Conference PaperPDF Available

Why Many Benchmarks Might Be Compromised

Authors:

Abstract

Benchmarking experiments often draw strong conclusions but lack information about the environmental influences like the hardware used to deploy the investigated system. Fairness and repeatability of these benchmarks are at least questionable. Developing for or migrating applications to the cloud or DevOps environments often requires performance testing, either for ensuring quality-of-service or for choosing the correct service parameters when deciding for a cloud offering. While building a benchmarking pipeline for cloud functions, the typical assumption is that a CPU scales the resources linearly to the used utilization. Due to heat generation, noise and other constraints, this is not the case due to the trade off between efficiency and performance. To investigate this trade off and its implications, we set up some experiments in order to evaluate the influence of these factors for benchmark results. We solely focus on Intel CPUs. Beginning with the second generation (Sandy Bridge), Intel uses their own scaling driver intel_pstate. Our results show that different settings for this scaling driver have a significant impact on the measured performance and therefore on the linear regression models we computed using LINPACK benchmarks. These benchmarks are executed at different CPU utilization points. An active intel_pstate scaling driver with enabled turbo boost and powersave governor reached a R² of 0.7349, whereas the performance governor shows a significantly better, ideal determination coefficient with 0.9999 on a machine used in the benchmarks. Therefore, we propose a methodology for system calibration to ensure fair and repeatable benchmarks.
Why Many Benchmarks Might Be Compromised
Johannes Manner and Guido Wirtz
Distributed Systems Group
Otto-Friedrich-University
Bamberg, Germany
{johannes.manner, guido.wirtz}@uni-bamberg.de
Abstract—Benchmarking experiments often draw strong con-
clusions but lack information about the environmental influences
like the hardware used to deploy the investigated system. Fairness
and repeatability of these benchmarks are at least questionable.
Developing for or migrating applications to the cloud or DevOps
environments often requires performance testing, either for
ensuring quality-of-service or for choosing the correct service
parameters when deciding for a cloud offering.
While building a benchmarking pipeline for cloud functions,
the typical assumption is that a CPU scales the resources linearly
to the used utilization. Due to heat generation, noise and other
constraints, this is not the case due to the trade off between
efficiency and performance. To investigate this trade off and its
implications, we set up some experiments in order to evaluate the
influence of these factors for benchmark results. We solely focus
on Intel CPUs. Beginning with the second generation (Sandy
Bridge), Intel uses their own scaling driver intel_pstate. Our
results show that different settings for this scaling driver have a
significant impact on the measured performance and therefore
on the linear regression models we computed using LINPACK
benchmarks. These benchmarks are executed at different CPU
utilization points. An active intel_pstate scaling driver
with enabled turbo boost and powersave governor reached
a R² of 0.7349, whereas the performance governor shows a
significantly better, ideal determination coefficient with 0.9999
on a machine used in the benchmarks. Therefore, we propose a
methodology for system calibration to ensure fair and repeatable
benchmarks.
Index Terms—Intel pstate, Benchmark Design, Benchmarking,
Simulation, Profiling, Experiment Design, Repeatability
I. INTRODUCTION
Benchmarking is the process of understanding system qual-
ities and the basis to re-engineer closed source systems and
unravel their implementation mysteries. Often investigated
qualities of a system are the behavior under heavy load
and implicitly the scalability and resilience properties. To
understand the performance and runtime characteristics of
Systems under Test (SUT), benchmarking these systems is
an indispensable procedure. It is vital that experimenters
document all influencing factors modified in such a way
that other researchers/practitioners are able to interpret the
results correctly. As KUH LE NKAMP and WER NE R [1] stated
in a literature review about Function as a Service (FaaS)
benchmarks, only 12% of the experiments (3 out of 26) specify
all information to reproduce the experiment.
This problem of non-repeatable experiments gets even
worse, when a series of experiments does not state hypotheses
upfront and does not start with single, isolated experiments
to confirm or reject these hypotheses before conducting load
tests. Tools like JMeter1allow to stress test SUT, which can
lead to predictions on how the system will behave under heavy
load. CPU and other hardware resources are strained but often
the implications of the hardware used is neglected [2]–[4].
Since these SUT are running on different hardware within
the software lifecycle, the runtime behavior may diverge
which is important when configuring one system based on
the measured quality of service of another system. Dev and
prod environments, for example, rarely run on the exact
same hardware. General purpose processors, i. e., consumer
processors, focus on optimizing the average-case performance
by employing runtime performance enhancements. However,
these techniques are normally not documented since they are
Intellectual Property (IP) of the vendor. End users have limited
control over them. Frequency scaling of the CPU is influenced
by a lot of factors specified in the Advanced Configuration and
Power Interface (ACPI) specification [5]. In particular, perfor-
mance states (P-states), cooling requirements and turbo boost
options influence the frequency scaling, power consumption
and heat generation of the CPU. However, for load tests a
linear scaling of resources is important for interpretable and
fair results.
In a previous experiment, we faced some unpredictable and
non linear performance distributions on two Intel i7 proces-
sors. One of our machines showed three performance ranges
due to power saving aspects. However, there are many settings
influencing each other and ultimately the performance of the
CPU making it hard to trace performance variations back to
a single factor. DELLs configuration of High Performance
Computing (HPC) servers [6], [7] or the SPEC bios settings
descriptions2for their CPU benchmarks are examples for
this. Simply disabling all powersaving or performance boost
options is not an option since the default settings do not
ensure linear scaling of resources. It is likely that also other
investigations face this non linear performance distribution on
machines used in their experiments without detecting it due
to noise in the benchmarking data.
Therefore, in this paper we propose a simple function
to calibrate hardware and understand the scaling algorithms
determining CPU performance in order to allow for fair
benchmarks. This concept is also usable for later processors
1https://jmeter.apache.org/
2https://www.spec.org/cpu2017/flags/Intel-Platform-Settings-V1.1.html
--- PREPRINT ---
of the Intel i-series or Xeon processors which - dependent
on the model - use intel_pstate as their scaling driver.
Since a lot of factors influence the CPU frequency scaling at
different CPU utilization points, an understanding of runtime
characteristics is necessary in order to compare different
systems. For fair experiments, we assume a linear scaling when
increasing the system resources, particularly the CPU access
time of a processor. An implementation of our simple hardware
calibration functionality is available as a CLI feature3. A run-
ning Docker environment is the only prerequisite to execute
the calibration. Our benchmark can serve as a standard for
normalizing results of benchmark experiments.
This leads to the following research question:
RQ: How can a consistent CPU scaling behavior across
different processors be achieved and made visible to
ensure the validity and comparability of benchmark ex-
periments?
We limit the discussion to Linux4and Intel CPUs5since
they are the predominant technology used in the cloud. The
agenda is as follows: In Section II we recap some important
fundamentals for P-states and the CPU frequency scaling
options in the Linux kernel. The next Section discusses
related work of benchmark characteristics and how P-states
were addressed in other research. Section IV describes the
motivation of our work in greater detail and continues with
the presentation of the problem. The next Section proposes
our methodology to answer RQ followed by an evaluation.
The discussion of the results and limitations of our work is
presented in Section VII. Finally, we conclude our paper with
ideas and next steps for future work.
II. FUNDAMENTALS
According to the ACPI specification6, Control States (C-
states) and Performance States (P-states) determine power
consumption and frequency of the CPU. C-states handle the
working mode of the machine, e.g. active or suspended. Since
we assume an active one, C-states will not be considered in the
following. P-states determine the CPU frequency and therefore
the computational power. They can be changed by algorithms
when demand changes. There are a few conflicting goals
which need to be addressed when changing the P-states [5].
On the one hand, performance and on the other hand, power
consumption, battery life, thermal and fan noise.
The Linux kernel CPU Frequency scaling (CPUFreq) sub-
system consists of three components7:
Scaling Governors - Each scaling governor imple-
ments an algorithm for estimating the CPU demand
3https://github.com/johannes-manner/SeMoDe/releases/tag/v0.4
4Market share Linux: https://www.rackspace.com/en-gb/blog/
realising-the- value-of-cloud- computing-with- linux (last accessed 2021-
04-16)
5Market share Intel: https://www.statista.com/statistics/1130315/
worldwide-x86- intel-amd-laptop-market-share/ (last accessed 2021-04-
16)
6Especially Section 8 in the specification is of particular interest (pages
509ff. in [5])
7https://www.kernel.org/doc/html/v5.4/admin-guide/pm/cpufreq.html
and changes the processors’ frequency accordingly. Also
mixed strategies where different scaling governors work
together are reasonable to achieve a good system per-
formance under various loads. Examples for governors
are performance (highest frequency) or powersave
(lowest frequency).
Scaling Drivers - ”Provide scaling governors with infor-
mation on the available P-states (or P-state ranges in some
cases) and access platform-specific hardware interfaces to
change CPU P-states as requested by scaling governors.8
CPUFreq Core - Basic code infrastructure framework
which the other two components integrate with.
The options listed here are the implementation in the Linux
kernel. Since the scaling driver communicates with the hard-
ware, vendor-specific options cannot be addressed by a generic
implementation. Therefore, the implementation of vendor-
specific scaling drivers was introduced. Since SandyBridge
(second generation of Intel processors), intel_pstate is
such a scaling driver making it possible to also implement
own scaling governors or overwriting existing ones. This driver
circumvents the generic implementations and also adds new
features where Hardware-Managed P-states (HWP) enable
customized scaling algorithms to deal with specialties of each
processor family and model. When HWP is turned off, the
generic scaling information specified in ACPI tables are used.
III. RELATED WORK
A. Benchmark Characteristics
One of the first publications about benchmark characteristics
is the work of HU PP LE R [8]. He states that a good bench-
mark design needs to be relevant, repeatable, fair, verifiable
and economical. As already mentioned in the introduction,
especially the repeatability and therefore also the fairness of
some experiments are difficult to ensure. Missing information
about the configuration, the number of executions or the
load distribution diminish the confidence in an experiment
and its results. Also, the equivalence of the benchmarking
environment and the production environment, often called dev-
prod parity, is seldom achieved and introduces another source
of discrepancy.
Other publications also introduce additional benchmark
characteristics for the cloud, e.g. COOP ER et al. [9] and
BER MBACH et al. [2]. The latter especially mentions the
underlying hardware stack, whereas CO OP ER et al. introduce
portability, scalability and simplicity as characteristics. For the
cloud in particular, different CPU architectures are present [3],
[4] which emphasizes the importance of transferable results of
benchmarks and the need to compare physical resources.
B. Experiments with intel_pstate
Since the scaling of CPU resources determines the CPU
frequency and therefore the speed, we cluster previous re-
search by their use of intel_pstate if this configuration
8https://www.kernel.org/doc/html/v5.4/admin-guide/pm/cpufreq.html#
cpu-performance- scaling-in- linux
--- PREPRINT ---
is explicitly mentioned. The Linux kernel has been supporting
this scaling driver since kernel version 3.9 got released in
2013 [10]. As the kernel documentation describes9, how the P-
state is translated into frequencies depends also on the specific
processor model and family. Energy consumption grows pro-
portionally with the frequency, therefore also ACPI compliant
low power solutions are researched for the cloud [11].
Overall, there are four configuration options10. Option one
and two are to use intel_pstate scaling driver in active
mode with (1) or without (2) hardware support (no_hwp).
The third option is to use it in passive mode (3), whereas
the last option is to disable intel_pstate (4). Disabling
it results in the usage of the generic acpi-cpufreq scaling
driver. Some vendor-specific hardware properties, which can
be read by the intel_pstate scaling driver, cannot be
used in this case. There are several reasons to do so. BECKER
and CH AKRABORT Y [12] fixed the CPU performance of their
system by disabling this feature and also the turbo boost
options. Reducing noise for a particular use case was another
reason to disable it [13]. Some researchers, e.g. [14], wanted
to be more flexible in changing the frequency by hand during
their benchmarks. Since some of the generic scaling governors
are overwritten (by using the same name) by intel_state
and others are not usable, sometimes researchers [15], [16]
disabled it to use the generic ones. We refer the interested
reader to the Linux kernel documentation. None of the papers
we identified by searching at google scholar and ACM digital
library specify one of the first three options (1-3) explicitly11.
Since the default configuration is an active
intel_pstate (for some models also with HWP enabled)
we assume that a lot of benchmarks use this configuration
when doing experiments on-premise.
IV. P ROB LE M ANA LYSIS
While working on our simulation and benchmarking
pipeline proposed in [17], we have seen the performance
behavior shown in Figure 1 when executing our calibration
function.
H60 and H90 are Intel quad-core machines with Ubuntu
20.04.2 as the OS and Docker for executing the function
seen in Figure 1. The machines’ specifications can be found
in Table I. We specified the cpus12 Docker CLI option to
limit the CPU usage by the containers. At each point in time,
only a single container is running on the mentioned machines
9https://www.kernel.org/doc/html/v5.4/admin-guide/pm/intel pstate.html#
processor-support
10In the /etc/default/grub file, the
GRUB_CMDLINE_LINUX_DEFAULT property can be changed
to a value explained in the kernel documentation (https:
//www.kernel.org/doc/html/v5.4/admin-guide/pm/intel pstate.html#
kernel-command- line-options-for-intel-pstate). Via sudo update-grub,
these changes can be applied and the system ca be rebooted.
11Search term was intel pstate resulted in ten records at ACM Digital
Library and 97 on google scholar. The result list at google scholar contained
a lot of presentations and also links to the Linux kernel documentation. All as
relevant identified conference and journal publications were included in this
paragraph.
12https://docs.docker.com/config/containers/resource constraints/#cpu
0 1 2 3 4
0 20 40 60 80
H60
CPU Quota
GFLOPS
0 1 2 3 4
0 50 100 150 200
H90 − 1.a
CPU Quota
GFLOPS
Fig. 1. Calibrating local Machines for Benchmarking.
TABLE I
SPE CIFI CATIO NS O F THE T WO MACH INE S OF T HE SHOW N EXP ERI MEN TS.
H60 H90
Processor i7-2600 i7-7700
Model 42 158
Base Frequency 3.40 GHz 3.60 GHz
Turbo Boost 3.80 GHz 3.90 GHz
Linux Kernel 5.4.0-65 5.4.0-70
together with our prototype which collects the metrics. The
impact of our prototype on the CPU utilization is negligible.
We measured it using the sar command in three system
states: when no function is running, the prototype starts and
the prototype idles. We did not see a noteworthy deviation to
the clean system state13.
We executed LINPACK [18], [19] which solves linear
equations as a CPU intensive function packaged in a Docker
container at runtime. Each setting present in Figure 1 was exe-
cuted 25 times by increasing the Docker cpus option by 0.1.
Both machines use intel_pstate in active mode as their
scaling driver and powersave as the scaling governor. HWP
is enabled on H90 and not available on H60. At each share of
the CPU, e.g. 0.5 cpus, the assigned portion of the CPU is
nearly fully utilized due to the LINPACK characteristics. This
is important to keep in mind when interpreting the diagrams
and the subsequent results. In other words, we mimic artificial
13Look at the following file which shows the utilization and changes
when starting the prototype: https://github.com/johannes-manner/SeMoDe/
files/6336159/utilization.txt.
--- PREPRINT ---
situations where a defined portion of the system is under heavy
load and look at the performance of our system. For example
when assigning 0.5 cpus on a system with four cores, the
CPU utilization is around 12.5%14.
The situation described here is contrived since in a normal
load test, the impact of the frequency scaling might be hidden
within the noise of other influencing factors. The CPU is
normally not executed long enough at a given utilization to
see the phenomenon in the data. Therefore, it is necessary to
create a testbed where we can assess this influencing factor in
isolation and make changes in the configuration visible.
As mentioned before, for laptops or machines with cooling
problems etc., the choice of the scaling driver impacts the
power consumption and heat generation. Furthermore, even
different models within the same generation of a processor line
have an impact on the frequency scaling due to their specific
hardware support for HWP. Being aware of this scaling
phenomenon makes it easier for experimenters to choose a
suitable scaling behavior for their benchmarks. This enables a
performance estimation under low, moderate and high load of
a system and does not jeopardize the results and, hence the
conclusions drawn.
TABLE II
LIN EAR R EGR ESS IO N MOD ELS F OR DATA PRE SE NTE D IN FIGURE 1
H60 H90 - 1.a
p-value <2.2e-16 <2.2e-16
0.9995 0.7349
Intercept -3.081 -50.340
Slope 23.400 56.357
Max GFLOPs 90.905 215.818
These problems are now expressed in numbers. The orange
lines in Figure 1 show the linear regressions. Table II shows
the statistics to the figures for H60 and H90. The coefficient
of determination (R²) for H60 shows a near ideal relation-
ship between the dependent Giga Floating Point Operations
per Second (GFLOPS) and the independent cpus variable.
Therefore, the results of a benchmark on this machine are
comparable and fair since doubling the resources results in
doubling the GFLOPS. The intercept is negligible in this case
and explainable due to inherent computational overhead. Con-
trary, on H90, the relationship between cpus and GFLOPS is
good with R² being 0.7349, but it is obvious when looking at
Figure 1 (right) that three performance ranges are visible from
[0, 0.5], [0.6, 2.8] and [3.0, 4.0] with different slopes. When
checking the available governors at H90, powersave and
performance are active. The kernel documentation states
that the ”processor is permitted to take over performance
scaling control”15 when exceeding a threshold. When further
looking at the different CPUs and their frequencies at runtime
14Due to other processes running on the system, the utilization is a bit
higher, but shared services running in the background are negligible as can
be seen for the prototype influence measured via sar.
15https://www.kernel.org/doc/html/v5.4/admin-guide/pm/intel pstate.html#
turbo-p- states-support
via tools like turbostat16, we can see that the powersave
scaling governor is used for the second interval operating at
minimum frequency and the performance scaling governor
is used for the first and third interval. Therefore, a fair
comparison of SUT deployed on H60 and H90 is questionable
since H90 performs worse under moderate load than under
peak load.
This observation and the statistical evaluation already em-
phasize the need for a calibration of the CPU performance.
V. METHODOLOGY
We propose the following solution (RQ). We use the
LINPACK benchmark as a CPU intensive calibration function
and report the metrics specified in Table II to the user of our
research prototype’s CLI17. Even though the case described
in the previous section is contrived, it gives us the option to
isolate CPU performance and to make changes in the con-
figuration of intel_pstate visible. To be able to restrict
CPU resources to a single function, we use the Docker CLI
cpus option. This gives us the chance to artificially fix the
CPU utilization at a given value and understand the scaling
of the CPU frequency and the performance by looking at the
computed GFLOPS.
LINPACK is especially suited to assess the performance of
multi-core hardware since it makes use of all available CPUs
and by being machine independent [18]. The same holds true
for load testing tools, where concurrent users of a system
can be simulated to stress the SUT. Other functions using the
available resources in a similar way are also possible for this
proposed calibration step, but LINPACK is well established in
this domain. An excerpt of the LINPACK output is shown in
Listing 1. We run our Docker container with a CPU share of
1.0 cpus on H90 in this example.
Listing 1. Sample LINPACK execution on H90 for a CPU share of 1.0.
>d o c k e r ru n − − c p u s = 1 . 0 jm n n r / l i n p a c k : v1
. . .
I n t e l ( R ) Op t i m i z e d LINPACK B en c hm a rk da t a
C u r r e n t d a t e / t i m e : F r i Ap r 16 1 1 : 1 0 : 5 2 2 02 1
. . .
=== ===== ==== S i n g l e Ru ns = ==== ===== ==
S i z e LDA A l i g n . Ti me ( s ) GF l o p s
10 0 0 1 00 0 4 0 . 0 0 5 1 4 4 . 8 3 1 9
10 0 0 1 00 0 4 0. 0 9 5 7 . 0 7 4 6
. . .
10 0 0 0 25 0 0 0 4 5 8 . 0 8 0 1 1 . 4 8 1 9
10 0 0 0 25 0 0 0 4 5 8 . 2 8 9 1 1 . 4 4 0 6
P e r f o r m a n c e S umm ary ( G F l op s )
S i z e LDA A l i g n . A ve r a g e M ax i ma l
10 0 0 1 00 0 4 8 7 . 1 9 7 8 14 4 . 8 3 1 9
50 0 0 1 80 0 0 4 10 . 9 0 3 4 1 0 . 9 6 3 8
10 0 0 0 25 0 0 0 4 1 1 . 4 6 1 2 1 1 . 4 8 1 9
R e s i d u a l c h e c k s PASSED
. . .
16https://www.linux.org/docs/man8/turbostat.html
17https://github.com/johannes-manner/SeMoDe/releases/tag/v0.4
--- PREPRINT ---
The Single Runs and Performance Summary sections present
the size of the matrix which is used for the linear computation
and the leading dimension of A (LDA) which also determines
the storage of arrays in memory. What is interesting in the
Single Runs section of the output is the problem size of the
linear equation system. Problem size 1’000 reached quite a
high number of GFLOPS. At this point in time the frequency
scaling of the CPU is not stable and also the equations for this
problem size are executed within a few milliseconds which
distorts the accuracy of the CPU performance measurement in
GFLOPS. In addition, the problem sizes are executed repeat-
edly for more stable results as can be seen for the two runs of
problem size 10’000. Therefore, we use the average GFLOPS
of the experiment with the largest problem size because the
equations in this case run a sufficient period of time to get a
stable scaling under this portion of CPU utilization. For sake
of simplicity and to be comparable, we package the LINPACK
function and push the image18 to Docker Hub, which is used
by our prototype per default. We further parse the results
(Listing 1) of LINPACK to get GFLOPS.
VI. EVALUATION
In our evaluation, we look at the four cases introduced
in related work. We solely focus on H90 since H60 shows
an already acceptable CPU performance distribution under
diverse load settings. For option one, we further investigate
sub-cases to be more precise in drawing conclusions on this
specific machine and show the most important settings.
1) Scaling driver intel_pstate in active mode with
HWP support.
a) Turbo boost on, powersave scaling governor.
b) Turbo boost off19,powersave scaling governor.
c) Turbo boost on, performance scaling governor.
d) Turbo boost off, performance scaling governor.
2) Scaling driver intel_pstate in active mode without
HWP support, powersave scaling governor20.
3) Scaling driver intel_cpufreq since
intel_pstate is in passive mode. Scaling governor
is ondemand21.
4) Scaling driver is here acpi-cpufreq and governor
ondemand.22.
For each setting (except for 1.a), we have only a single exe-
cution. In our methodology we propose that by default we only
need a single run to assess the quality of a system. Especially
the active HWP case is investigated further by changing the
scaling governor to performance and enabling/disabling
turbo boost. Options 2 to 4 are investigated in the default
setting when updating grub, so turbo boost is enabled in all
of these cases. The input for LINPACK is constant for all
executions with three different matrix sizes (1’000, 5’000 and
18https://hub.docker.com/repository/docker/jmnnr/linpack
19Change /sys/devices/system/cpu/intel_pstate/no_
turbo to ”1” disables the turbo boost. ”0” indicates an enabled turbo boost.
20GRUB_CMDLINE_LINUX_DEFAULT="intel_pstate=no_hwp"
21GRUB_CMDLINE_LINUX_DEFAULT= "intel_pstate=passive"
22GRUB_CMDLINE_LINUX_DEFAULT= "intel_pstate=disable"
10’000) as can be seen in Listing 1. We use the average of
the highest problem size (10’000) for the statistical evaluation
and the figures presented in the following.
The first one (1.a) was already shown in Figure 1 (right)
and evaluated in Table II. The CPU is configured with active
HWP and turbo boost23.
01234
0 50 100 150 200
H90 − 1.b
CPU Quota
GFLOPS
01234
0 50 100 150 200
H90 − 1.c
CPU Quota
GFLOPS
01234
0 50 100 150 200
H90 − 1.d
CPU Quota
GFLOPS
Fig. 2. Calibrating H90 in different settings by changing scaling governor
and turbo boost.
TABLE III
LIN EAR R EGR ESS IO N MOD ELS F OR DATA PRE SE NTE D IN FIGURE 2
H90 - 1.b H90 - 1.c H90 - 1.d
p-value 6.6e-16 <2.2e-16 <2.2e-16
0.7406 0.9999 0.9999
Intercept -45.203 -1.820 -1.715
Slope 51.649 54.118 49.389
Max GFLOPS 197.101 215.939 196.908
The other three sub-configurations under option 1 are inves-
tigated in Table III and Figure 2. In 1.b we turned off turbo
boost, resulting in the same distribution but the maximum
achieved GFLOPS is around 10% lower, which is reasonable
when looking at the base clock rate of 3.6 GHz and 3.9 GHz
in turbo boost mode. The same observation can be made when
comparing 1.c and 1.d with each other. The distributions are
equal except for the absolute value of GFLOPS.
The difference between 1.a/1.c and 1.b/1.d respectively is
the scaling governor used. We exchanged the powersave
23https://www.intel.com/content/www/us/en/architecture-and- technology/
turbo-boost/turbo- boost-technology.html
--- PREPRINT ---
with the performance governor24. For the performance
governor, switching from one to the other algorithm does not
happen since the CPU is operated at maximum frequency.
Therefore, this configuration is a candidate for doing fair and
repeatable benchmarks on H90. The drawback here is, that
the power consumption is also at its maximum resulting in
additional heat generation and power consumption.
01234
0 50 100 150 200
H90 − 2
CPU Quota
GFLOPS
01234
0 50 100 150 200
H90 − 3
CPU Quota
GFLOPS
01234
0 50 100 150 200
H90 − 4
CPU Quota
GFLOPS
Fig. 3. Calibrating H90 in different settings by changing driver.
TABLE IV
LIN EAR R EGR ESS IO N MOD ELS F OR DATA PRE SE NTE D IN FIGURE 3
H90 - 2 H90 - 3 H90 - 4
p-value <2.2e-16 <2.2e-16 <2.2e-16
0.9953 0.9975 0.9976
Intercept -14.378 -7.169 -7.775
Slope 55.362 54.254 54.395
Max GFLOPS 215.835 215.880 215.740
Option 2 to 4 are graphically presented in Figure 3 and
statistically in Table IV. Compared to 1.a, the only difference
of option 2 is an active intel_pstate without HWP
support. The HWP support has an impact on the scaling
algorithm and enables switching between powersave and
performance governor as already seen in Figure 1 (right),
whereas the system configured without HWP only uses the
powersave governor which ”selects P-states proportional to
24sudo cpufreq-set --cpu n --governor performance
where n is a processor
the current CPU utilization”25 in this operation mode. This
results in the undulations seen in Figure 3 (top).
Passivating intel_pstate (option 3) results in the usage
of the intel_cpufreq scaling driver and the ondemand
scaling governor. As stated in the documentation, the HWP
support is also disabled. Compared to the second option, the
performance behavior is quite similar, however, the governor
uses the CPU load to determine the CPU frequency. Only 16
fixed P-sates are provided by the generic ACPI frequency table
which explains the non-linear scaling behavior.
Most researchers disable intel_pstate and use the
generic acpi-cpufreq scaling driver. Since option 3 and
4 use the same governor and the information available in the
ACPI tables, their results are similar.
As for the first option, a fine tuning for the other options is
necessary to reach a good and stable scaling. The presented
differences for the latter three options are only a starting point.
VII. CON CLUSION
A. Discussion of the Results
The presented methodology enables a calibration of systems
with respect to the CPU performance. We use LINPACK as
a machine independent benchmark to assess the frequency
scaling of the CPU and express the power via GFLOPS
when solving linear equations. The approach showed one
solution to solve the initially motivated situation, where an
unpredictable scaling was present. We investigated the most
important influencing factors for Intel CPUs under Linux
namely the scaling driver and its corresponding governors.
Due to vendor-specific implementations, it is possible to
make use of the vendor-specific knowledge about the system
components like intel_pstate showed. Nevertheless, in
some configurations, this leads to performance distributions
which are questionable when conducting fair and repeatable
benchmarks. The four options identified in related work cover
only a small portion of the overall possible system config-
uration when taking all possible settings into consideration.
Why a lot of researchers disabling intel_pstate to use
the generic acpi-cpufreq scaling driver is not discussed
explicitly in their work and also not reasonable when looking
at the results of our evaluation. Therefore, it is important
to conduct calibration experiments as ours upfront to find
a good configuration for benchmarking SUT. Only option 1
was examined in detail by changing the scaling governor and
testing the system with turbo boost enabled and disabled but
without other fine-tuning.
Figure 4 shows how important such a linear scaling is for
dev-prod parity considerations where a LINPACK benchmark
was executed on AWS Lambda (region eu-central-1) 100 times
for different memory settings on Intel Xeon processors with
2.50 GHz, model 63. This cloud platform shows a linear
scaling with a R² of 0.9973 (Intercept: -1.995, Slope: 0.0197,
Max. GFLOPS: 209.757) and therefore guarantee a stable
25https://www.kernel.org/doc/html/v5.4/admin-
guide/pm/intel pstate.html#powersave
--- PREPRINT ---
0 2000 4000 6000 8000 10000
0 50 100 150 200
Calibration on AWS Lambda
CPU Quota
GFLOPS
Fig. 4. Executing LINPACK on AWS Lambda for different memory settings.
quality of service. Without a similar performance distribution
locally, we cannot make implications on how software will
run on other machines or in the cloud.
B. Threats to Validity
Single Machine - We only looked in detail at a single Intel
processor (i7-7700, model 158) in this paper and assessed
its specific configuration. For further generations of Intel
processors (second generation onwards) intel_pstate is
used as a scaling driver and we assume that the behavior shown
here is also present at some of these machines dependent on
the model and processor line.
Single Vendor - We only looked at Intel processors, but we
assume that our methodology also works on other processors
like AMDs.
Minimal Dimension - CPU performance was the only
dimension we were interested in in this paper. To draw strong
conclusions, we tried to reduce the influence of all other
factors to a minimum. We are aware that the CPU performance
is also influenced by cooling requirements, network access,
hard disk speed, system bus, etc.
Sample Size - As mentioned in the evaluation, we only
executed a single run for each experimental setup since we also
propose to do this when using our methodology in practice. We
have seen in Figure 1 that especially the transition from the
second to the third performance interval is interesting since
the influences of the scaling algorithm switch can be seen
in greater detail for the 25 executions. There is a trade-off
between execution time and accuracy of the results. However,
in Figure 2 (top) we have seen that also a single execution is
enough to show a non-linear scaling.
LINPACK Configuration - The problem size, number
of runs and the LDA of the LINPACK benchmark can be
specified as input. All calibration runs in this paper were
executed with the same set of parameters, which could bias the
results since the LDA is also responsible for how the matrix A
consisting the linear equations is stored in the heap.
VIII. FUT URE WORK
Due to the various threats mentioned, there is a lot to do
in future work to enable an even fairer setup for benchmarks.
Our plan for future work is twofold.
Firstly, we want to look at other factors related to the CPU
and influencing the performance. First and foremost, we want
to look at memory influences by changing the LINPACK input
parameters. Also system bus capabilities and other components
play a vital role which might require some specific calibration
considerations to assess the quality of a system. Furthermore,
the cooling capabilities are important to keep in mind when
operating the system with the performance governor and
in turbo boost mode. Finally, different Intel processors and
other vendors are in focus of the next research step.
Secondly, we want to implement some visual support to
generate figures as presented in this paper by our prototype.
Currently, the support is limited to the statistical evaluation by
getting the intercept, slope and R².
REF ERENCES
[1] J. Kuhlenkamp and S. Werner, “Benchmarking FaaS Platforms: Call for
Community Participation,” in Proc. of WoSC, 2018.
[2] D. Bermbach et al.,Cloud Service Benchmarking. Springer Interna-
tional Publishing, 2017.
[3] J. O’Loughlin and L. Gillam, “Performance evaluation for cost-efficient
public infrastructure cloud use,” in Proc. of GECON, 2014.
[4] R. Cordingly et al., “Predicting performance and cost of serverless
computing functions with SAAF,” in Proc. of DASC/PiCom/CBDCom/-
CyberSciTech, 2020.
[5] UEFI-Forum, “Acpi specification, version 6.3,” UEFI Forum, Inc.,
Tech. Rep., 2019. [Online]. Available: https://uefi.org/sites/default/files/
resources/ACPI 6 3 final Jan30.pdf
[6] J. Beckett, “Bios performance and power tuning guidelines for dell
poweredge 12th generation servers,” DELL, Tech. Rep., 2012.
[7] G. Kocchar et al., “Optimal bios settings for hpc with dell poweredge
12th generation servers,” Citeseer, Tech. Rep., 2012.
[8] K. Huppler, “The Art of Building a Good Benchmark,” in Proc. of
TPCTC, 2009.
[9] B. F. Cooper et al., “Benchmarking cloud serving systems with YCSB,”
in Proc. of SoCC, 2010.
[10] C. Gough, I. Steiner, and W. Saunders, “Operating systems,” in Energy
Efficient Servers. Apress, 2015, pp. 173–207.
[11] M. Karpowicz et al., “Energy and power efficiency in cloud,” in
Computer Communications and Networks. Springer International
Publishing, 2016, pp. 97–127.
[12] M. Becker and S. Chakraborty, “Measuring software performance on
linux,” arXiv e-Prints - 1811.01412, 2018.
[13] J. Dorn et al., “Automatically exploring tradeoffs between software
output fidelity and energy costs,” IEEE Transactions on Software Engi-
neering, vol. 45, no. 3, pp. 219–236, 2019.
[14] E. Calore et al., “Software and DVFS tuning for performance and
energy-efficiency on intel KNL processors,Journal of Low Power
Electronics and Applications, vol. 8, no. 2, p. 18, 2018.
[15] A. Rumyantsev et al., “Evaluating a single-server queue with asyn-
chronous speed scaling,” in Lecture Notes in Computer Science.
Springer International Publishing, 2018, pp. 157–172.
[16] M. Horikoshi et al., “Scaling collectives on large clusters using intel(r)
architecture processors and fabric,” in Proc. of HPC Asia, 2018.
[17] J. Manner, “Towards Performance and Cost Simulation in Function as
a Service,” in Proc. of ZEUS, 2019.
[18] J. J. Dongarra et al.,LINPACK Users’ Guide. Society for Industrial
and Applied Mathematics, 1979.
[19] ——, “The linpack benchmark: past, present and future,” Concurrency
and Computation: Practice and Experience, vol. 15, no. 9, pp. 803–820,
2003.
--- PREPRINT ---
... For AWS Lambda the CPU is assigned proportionally as already mentioned but what does this mean for the assigned CPU resources, i.e. linearly, exponentially or an unknown scaling behavior? When looking at CPU intensive tasks, like LINPACK [10], [23]- [25] or recursive Fibonacci implementations [10], [26], [27], we can state that CPU resources are linearly assigned based on the configured memory. This is only the case for functions executed on homogeneous hardware as experiments on AWS Lambda showed where heterogeneous hardware is also present [26], [28]- [31]. ...
... We introduced a way to calibrate machines in previous research [10] where we used different CPU quotas for executing LINPACK and computing a linear regression model which indicates the performance the system achieves at an artificial resource utilization. Due to powersave options or prevention of heat generation, at some machine configurations the CPU frequency scaling is not linear which can distort results [25]. In these cases, we have to adapt the BIOS and kernel configuration and calibrate the system again until we achieve a linear scaling with a suitable coefficient of determination. ...
Conference Paper
Full-text available
Open-source offerings are often investigated when comparing their features to commercial cloud offerings. However, performance benchmarking is rarely executed for open-source tools hosted on-premise nor is it possible to conduct a fair cost comparison due to a lack of resource settings equivalent to cloud scaling strategies. Therefore, we firstly list implemented resource scaling strategies for public and open-source FaaS platforms. Based on this we propose a methodology to calculate an abstract performance measure to compare two platforms with each other. Since all open-source platforms suggest a Kubernetes deployment, we use this measure for a configuration of open-source FaaS platforms based on Kubernetes limits. We tested our approach with CPU intensive functions, considering the difference between single-threaded and multi-threaded functions to avoid wasting resources. With regard to this, we also address the noisy neighbor problem for open-source FaaS platforms by conducting an instance parallelization experiment. Our approach to limit resources leads to consistent results while avoiding an overbooking of resources.
... The primary machine (H90 ) was a Fujitsu Esprimo P757 with an Intel Core i7-7700 CPU with 4 cores and 210 GFLOPS peak performance. We used a LINPACK benchmark to assess the peak performance and to verify the linear scaling behavior of our machines [22]. H90 had 32 GB of RAM and used a SSD with 256 GB as primary drive. ...
Conference Paper
Full-text available
MicroStream is a new in-memory data engine for Java applications. It directly stores the Java object graph in an optimized way, removing the burden of having to map data from the Java object model to the relational data model and vice versa, a problem well known as the impedance mismatch. Its vendor claims that their product outper-forms JPA-based systems realized with Hibernate. They furthermore argue that it is well-suited for implementing microservices in a cloud-native way where each service complies with the decentralized data management principle of microservices. Our work empirically assessed the performance of MicroStream by implementing two applications. The first one is a modified version of Mi-croStream's BookStore performance demo application. We used it to reproduce the data the MicroStream developers used as backing for their performance claims. The second application is an OLTP system based on the TPC-C benchmark specification. MicroStream does not provide any sophisticated features for concurrent data access management. Therefore, we created two distinct MicroStream-based approaches for our OLTP application. For the first solution, we used a third-party transaction management system called JACIS. The second solution relies on structured modelling and Java 1.0 concurrency concepts. Our results show that MicroStream is indeed up to 427 times faster when comparing the service execution time on the server with the fastest JPA transaction. From a user's perspective, where network overhead, scheduling etc. impact the overall server response time, MicroStream is still up to 47% faster than a comparable JPA-based solution. Furthermore, we implemented concurrent data access by using an approach based on structured modelling to handle lock granularity and deadlocks.
Thesis
Full-text available
Serverless Computing is seen as a game changer in operating large-scale applications. While practitioners and researches often use this term, the concept they actually want to refer to is Function as a Service (FaaS). In this new service model, a user deploys only single functions to cloud platforms where the cloud provider deals with all operational concerns – this creates the notion of server-less computing for the user. Nonetheless, a few configurations for the cloud function are necessary for most commercial FaaS platforms as they influence the resource assignments like CPU time and memory. Due to these options, there is still an abstracted perception of servers for the FaaS user. The resource assignment and the different strategies to scale resources for public cloud offerings and on-premise hosted open-source platforms determine the runtime characteristics of cloud functions and are in the focus of this work. Compared to cloud offerings like Platform as a Service, two out of the five cloud computing characteristics improved. These two are rapid elasticity and measured service. FaaS is the first computational cloud model to scale functions only on demand. Due to an independent scaling and a strong isolation via virtualized environments, functions can be considered independent of other cloud functions. Therefore, noisy neighbor problems do not occur. The second characteristic, measured service, targets billing. FaaS platforms measure execution time on a millisecond basis and bill users accordingly based on the function configuration. This leads to new performance and cost trade-offs. Therefore, this thesis proposes a simulation approach to investigate this tradeoff in an early development phase. The alternative would be to deploy functions with varying configurations, analyze the execution data from several FaaS platforms and adjust the configuration. However, this alternative is time-consuming, tedious and costly. To provide a proper simulation, the development and production environment should be as similar as possible. This similarity is also known as dev-prod parity. Based on a new methodology to compare different virtualized environments, users of our simulation framework are able to execute functions on their machines and investigate the runtime characteristics for different function configurations at several cloud platforms without running their functions on the cloud platform at all. A visualization of the local simulations guide the user to choose an appropriate function configuration to resolve the mentioned trade-off dependent on their requirements.
Chapter
MicroStream is a new in-memory data engine for Java applications. It directly stores the Java object graph in an optimized way, removing the burden of having to map data from the Java object model to the relational data model and vice versa, a problem well known as the impedance mismatch. Its vendor claims that their product outperforms JPA-based systems realized with Hibernate. They furthermore argue that it is well-suited for implementing microservices in a cloud-native way where each service complies with the decentralized data management principle of microservices. Our work empirically assessed the performance of MicroStream by implementing two applications. The first one is a modified version of MicroStream’s BookStore performance demo application. We used it to reproduce the data the MicroStream developers used as backing for their performance claims. The second application is an OLTP system based on the TPC-C benchmark specification. MicroStream does not provide any sophisticated features for concurrent data access management. Therefore, we created two distinct MicroStream-based approaches for our OLTP application. For the first solution, we used a third-party transaction management system called JACIS. The second solution relies on structured modelling and Java 1.0 concurrency concepts. Our results show that MicroStream is indeed up to 427 times faster when comparing the service execution time on the server with the fastest JPA transaction. From a user’s perspective, where network overhead, scheduling etc. impact the overall server response time, MicroStream is still up to 47% faster than a comparable JPA-based solution. Furthermore, we implemented concurrent data access by using an approach based on structured modelling to handle lock granularity and deadlocks.
Conference Paper
Full-text available
Function as a Service (FaaS) promises a more cost-efficient deployment and operation of cloud functions compared to related cloud technologies, like Platform as a Service (PaaS) and Container as a Service (CaaS). Scaling, cold starts, function configurations, dependent services, network latency etc. influence the two conflicting goals cost and performance. Since so many factors have impact on these two dimensions, users need a tool to simulate the function in an early development stage to solve these conflicting goals. Therefore, a simulation framework is proposed in this paper.
Article
Full-text available
Energy consumption of processors and memories is quickly becoming a limiting factor in the deployment of large computing systems. For this reason, it is important to understand the energy performance of these processors and to study strategies allowing their use in the most efficient way. In this work, we focus on the computing and energy performance of the Knights Landing Xeon Phi, the latest Intel many-core architecture processor for HPC applications. We consider the 64-core Xeon Phi 7230 and profile its performance and energy efficiency using both its on-chip MCDRAM and the off-chip DDR4 memory as the main storage for application data. As a benchmark application, we use a lattice Boltzmann code heavily optimized for this architecture and implemented using several different arrangements of the application data in memory (data-layouts, in short). We also assess the dependence of energy consumption on data-layouts, memory configurations (DDR4 or MCDRAM) and the number of threads per core. We finally consider possible trade-offs between computing performance and energy efficiency, tuning the clock frequency of the processor using the Dynamic Voltage and Frequency Scaling (DVFS) technique.
Conference Paper
The number of available FaaS platforms increases with the rising popularity of a “serverless” architecture and development paradigm. As a consequence, a high demand for benchmarking FaaS platforms exists. In response to this demand, new benchmarking approaches that focus on different objectives continuously emerge. In this paper, we call for community participation to conduct a collaborative systematic literature review with the goal to establish a community-driven knowledge base.
Conference Paper
This paper provides results on scaling Barrier and Allreduce to 8192 nodes on a cluster of Intel® Xeon Phi™ processors installed at the University of Tokyo and the University of Tsukuba. We will describe the effects of OS and platform noise on the performance of these collectives, and provide ways to minimize the noise as well as isolate it to specific cores. We will provide results showing that Barrier and Allreduce scale well when noise is reduced. We were able to achieve a latency of 94 usec (7.1x speedup from baseline) or 1 rank per node Barrier and 145 usec (3.3x speedup) for Allreduce at the 16 byte (16B) message size at 4096 nodes.
Article
Data centers account for a significant fraction of global energy consumption and represent a growing business cost. Most current approaches to reducing energy use in data centers treat it as a hardware, compiler, or scheduling problem. This article focuses instead on the software level, showing how to reduce the energy used by programs when they execute. By combining insights from search-based software engineering, mutational robustness, profile-guided optimization, and approximate computing, the Producing Green Applications Using Genetic Exploration (PowerGAUGE) algorithm finds variants of individual programs that use less energy than the original. We apply hardware, software, and statistical techniques to manage the complexity of accurately assigning physical energy measurements to particular processes. In addition, our approach allows, but does not require, relaxing output quality requirements to achieve greater non-functional improvements. PowerGAUGE optimizations are validated using physical performance measurements. Experimental results on PARSEC benchmarks and two larger programs show average energy reductions of 14% when requiring the preservation of original output quality and 41% when allowing for human-acceptable levels of error.
Book
Cloud service benchmarking can provide important, sometimes surprising insights into the quality of services and leads to a more quality-driven design and engineering of complex software architectures that use such services. Starting with a broad introduction to the field, this book guides readers step-by-step through the process of designing, implementing and executing a cloud service benchmark, as well as understanding and dealing with its results. It covers all aspects of cloud service benchmarking, i.e., both benchmarking the cloud and benchmarking in the cloud, at a basic level. The book is divided into five parts: Part I discusses what cloud benchmarking is, provides an overview of cloud services and their key properties, and describes the notion of a cloud system and cloud-service quality. It also addresses the benchmarking lifecycle and the motivations behind running benchmarks in particular phases of an application lifecycle. Part II then focuses on benchmark design by discussing key objectives (e.g., repeatability, fairness, or understandability) and defining metrics and measurement methods, and by giving advice on developing own measurement methods and metrics. Next, Part III explores benchmark execution and implementation challenges and objectives as well as aspects like runtime monitoring and result collection. Subsequently, Part IV addresses benchmark results, covering topics such as an abstract process for turning data into insights, data preprocessing, and basic data analysis methods. Lastly, Part V concludes the book with a summary, suggestions for further reading and pointers to benchmarking tools available on the Web. The book is intended for researchers and graduate students of computer science and related subjects looking for an introduction to benchmarking cloud services, but also for industry practitioners who are interested in evaluating the quality of cloud services or who want to assess key qualities of their own implementations through cloud-based experiments.
Chapter
Reduction of energy consumption is clearly one of the major technological challenges arising with development of cloud computing infrastructures. To meet the ever increasing demand for computing power, recent research efforts have been taking holistic view to energy-aware design of hardware, middleware, and data processing applications. Indeed, advances in hardware layer development require immediate improvements in the design of system control software. For this to be possible, new power management capabilities of hardware layer need to be exposed in the form of flexible Application Program Interfaces (APIs). Consequently, novel APIs and cluster management tools allow for system-wide regulation of energy consumption, capable of collecting and processing detailed cluster performance measurements, and taking real-time coordinated actions across the cloud infrastructure. This chapter presents an overview of techniques developed to improve energy efficiency of cloud computing. Power consumption models and energy usage profiles are presented together with energy efficiency measuring methods. Modeling of computing and network dynamics is discussed from the viewpoint of system identification theory, indicating basic experiment design problems and challenges. Novel approaches to cluster and network-wide energy usage optimisation are surveyed, including multi-level power and software control systems, energy-aware task scheduling, resource allocation algorithms and frameworks for backbone networks management. Software-development techniques and tools are also presented as a new promising way to reduce power consumption at the computing node level. Finally, energy-aware server-level and network-level control mechanisms are presented, including ACPI-compliant low power idle and service rate scaling solutions.
Chapter
Software computation and data manipulation is the essence of the work done on a server and in a datacenter. What is often overlooked is the critical role operating systems play in determining both the performance (the speed of work) and the cost (the energy consumed) doing that work. Operating systems manage a software work plan; an efficient work plan gets work done faster and at lower cost.KeywordsEnergy EfficiencyVirtual MachinePower ManagementDevice DriverVirtual Machine MigrationThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.