ArticlePDF Available

Hardware Counted Profile-Guided Optimization



Profile-Guided Optimization (PGO) is an excellent means to improve the performance of a compiled program. Indeed, the execution path data it provides helps the compiler to generate better code and better cacheline packing. At the time of this writing, compilers only support instrumentation-based PGO. This proved effective for optimizing programs. However, few projects use it, due to its complicated dual-compilation model and its high overhead. Our solution of sampling Hardware Performance Counters overcome these drawbacks. In this paper, we propose a PGO solution for GCC by sampling Last Branch Record (LBR) events and using debug symbols to recreate source locations of binary instructions. By using LBR-Sampling, the generated profiles are very accurate. This solution achieved an average of 83% of the gains obtained with instrumentation-based PGO and 93% on C++ benchmarks only. The profiling overhead is only 1.06% on average whereas instrumentation incurs a 16% overhead on average.
Hardware Counted Profile-Guided Optimization
Baptiste Wicht (EIA-FR)
Roberto A. Vitillo
Dehao Chen (Google)
David Levinthal (Google)
Profile-Guided Optimization (PGO) is an excellent
means to improve the performance of a compiled pro-
gram. Indeed, the execution path data it provides
helps the compiler to generate better code and better
cacheline packing.
At the time of this writing, compilers only support
instrumentation-based PGO. This proved effective for
optimizing programs. However, few projects use it,
due to its complicated dual-compilation model and its
high overhead. Our solution of sampling Hardware
Performance Counters overcome these drawbacks. In
this paper, we propose a PGO solution for GCC by
sampling Last Branch Record (LBR) events and using
debug symbols to recreate source locations of binary
By using LBR-Sampling, the generated profiles are
very accurate. This solution achieved an average of
83% of the gains obtained with instrumentation-based
PGO and 93% on C++ benchmarks only. The profil-
ing overhead is only 1.06% on average whereas instru-
mentation incurs a 16% overhead on average.
Keywords Profile-Guided Optimization, Sam-
pling, Hardware Performance Counters, Compilers
1 Introduction
Profilers help developers and compilers find the main
areas for optimization. Profiling and optimizing by
hand is a time-consuming process, but this process
can be automated. Modern compilers include an op-
timization technique called Profile-Guided Optimiza-
tion (PGO). Feedback-Directed Optimization (FDO)
is also used as a synonym of PGO.
PGO uses information collected during the execu-
tion of a program to optimize it. Generally, edge exe-
cution frequencies between basic blocks are collected.
Several optimization techniques can take advantage
from the collected profile. For instance, the data can
be used to drive inlining decisions and block ordering
within a function to achieve minimal cacheline usage.
Branches can be reordered based on their frequency to
avoid branch misprediction. Loops working on arrays
causing Data Cache misses can be improved to make
better use of the cache. As the dynamic profile, unlike
the static profile, captures execution frequency, this
can result in impressive speedups for non IO-intensive
Compilers currently support instrumentation-based
PGO. In this variant, the compiler must first gener-
ate a special version of the application in which in-
structions are inserted at specific locations to generate
the profile. During the execution, counters are incre-
mented by these instructions and finally, the profile
is generated into a file. After that, the program is
compiled again, this time with PGO flags, to use the
profile and optimize the binary for the final version.
This approach has several drawbacks:
The instructions inserted into the program slow
it down. An instrumented binary can be much
slower than its optimized counterpart. This has
been reported to incurs between 9% and 105%
overhead[1, 2]. In practice, it has been observed
to be as much as an order of magnitude slower for
some applications.
The profile data must be collected on a specially
compiled executable. This dual-compilation
model is not convenient. Indeed, for applications
with long build time, doubling this time may de-
grade productivity.
Only a small set of information can be collected.
For example, it is not possible with this approach
to collect information about memory or branch-
prediction issues.
There is a tight coupling between the two builds.
It is generally necessary to use the same optimiza-
tion options both in the first and second compila-
tion. Without that, the control-flow graph (CFG)
of both compilation may not match and the pro-
filing data may not be used with enough accuracy.
Making large changes in the source code also in-
validates the previous profile data.
The instrumentation instructions can alter the
quality of the generated profile. As new instruc-
tions are inserted into the program, they may
arXiv:1411.6361v1 [cs.PL] 24 Nov 2014
change the results of the profiling, which may, in
turn change the optimization decisions.
For these reasons, traditional PGO has not been
widely adopted. Even most of the CPU-intensive
projects are not using this technique. To avoid these
drawbacks, we propose in this paper a solution based
on sampling Hardware Performance Events generated
by the Performance Monitoring Unit (PMU) instead
of instrumenting the application. Source position con-
tained in the debug symbols is used to recreate an
accurate profile.
Below, we list the primary contributions of this
1. We study Hardware Performance Events and
their use for PGO.
2. We build a complete toolchain, based on GCC,
able to perform sampling-based PGO.
3. Finally, we evaluate the performance of our imple-
mentation. We present results obtained with our
implementation in the GCC compiler with SPEC
2006 benchmarks. We show that our toolchain
can achieve 93% of the gains obtained using
instrumentation-based PGO and incurs only a
1.06% overhead, where instrumentation adds 16%
overhead on average.
The rest of this paper is organized as follows: Sec-
tion 2 describes how to combine PGO and sam-
pling Hardware Performance Events. Section 3 then
presents the toolchain that has been developed. Sec-
tion 4 describes the results obtained with sampling-
based PGO. Section 5 lists related work in the area.
Finally, Section 6 presents our conclusions and future
work for this project.
2 PGO and Performance Coun-
This section describes how sampling Hardware Perfor-
mance Counters can be used to perform PGO.
2.1 Hardware Performance Events
Every modern microprocessor includes a set of coun-
ters that are updated when a particular hardware
event arises. This set of counters is managed by the
Performance Monitoring Unit (PMU) of the CPU.
These events can then be accessed by an other ap-
The most common way of using these counters is by
sampling. The PMU has to be configured to gener-
ate an interrupt when a particular counter goes over
a specific limit (it overflows). At this point, the moni-
toring software can record the state of the system and
especially the current executed instruction, indicated
by the Program Counter (PC). This directly generates
a complete instruction profile for the binary instruc-
A basic block profile can be naturally estimated
from sampling. Each time the counter overflows, the
instruction identified by the PC is saved. At the end
of the execution, the number of samples for each in-
struction of a basic block are summed. The basic block
sums must be normalized to avoid giving higher weight
to larger basic blocks.
There is a lot of different events, from clock cy-
cles, to L2 cache misses or branch mispredictions.
The available events depend on the microarchitec-
ture. Some processors provides very large list of events
(more than 1,500 for the PowerPC Power7 family) and
some much fewer (69 for the ARM V7 Cortex-A9 pro-
The solution presented in this paper has been spe-
cially tuned for Intel R
CoreTM i7 events[10].
2.2 Sampling-Based PGO
Combining PGO and Hardware Performance Coun-
ters results in the sampling-based PGO technique. In
this model, the program that is profiled is directly the
production binary, there is no need for a special ex-
ecutable. However, the profile is this time generated
with a specific program that can sample the values of
the Hardware Performance Counters, namely a pro-
filer. Once a profile is generated, the compiler can use
it for the next compilations.
The main advantage of this approach is the much
smaller overhead of sampling compared to instrumen-
tation. The program is only interrupted when a
counter overflows, not every time a function is exe-
cuted for instance. The cost of sampling depends on
the sampling period and on the event that is sampled.
Moreover, the profiler can be patched on an already
running program. It means that the production exe-
cutable can be profiled for some hours without inter-
rupting it. Profiling on the production binary, with
the production input data, generally results in more
accurate profiles. Moreover, it is not necessary to find
training data for the instrumentation executable, a
task which can be hard depending on the profiled ex-
Since source position is used to match the program
and the profile, the coupling between the profile and
the binary is much smaller. Changing the compilation
options would not invalidate the profile. Moreover,
older profiles can still be used in new versions of the
application. It is not necessary to generate profiles
for each version of the program, except in the case of
major changes in the application.
The data generated by the profiler must be trans-
formed before being used for PGO. The samples con-
tain only the address of the instructions and the sam-
ple count. Information such as the source location of
the instruction is necessary to generate a real instruc-
tion profile useful for the compiler from this raw data.
This is explained in details in Section 3.4
On the other hand, there are also some drawbacks
to this method:
As the supported events are depending on the mi-
croarchitecture, sampling very specific events may
not be portable
As not all events are recorded, this method is
not as accurate as instrumentation-based profile.
In practice, it showed to be accurate enough for
bringing performance speedups.
The sampling period must be chosen carefully. In-
deed, a longer sampling period means a larger
overhead, but a more accurate profile. It is im-
portant to find the correct balance between the
two factors. After a certain point, it is not really
interesting to sample more events.
It may occur that the sampling period is synchro-
nized with some part of the code which can lead
to the point where only one instruction of a loop
is reported [5]. This problem can be solved by
using randomization on the period.
Even with these drawbacks, sampling-based PGO
is a better fit in the development process than the
traditional approach.
3 Implementation
Our implementation relies on the following compo-
1. perf[7]: The Hardware Performance Counters
profiler of the Linux kernel.
2. Gooda1[3]: An analyzer for perf data files. It
is the result of a collaboration between the
Lawrence Berkeley National Laboratory (LBNL)
and Google.
3. AutoFDO2(AFDO): A patch from Google bring-
ing sampling-based PGO to GCC. See Section 3.1
In this implementation, PGO is made in several
steps (the first three steps being automated in one
for the user by the profile generator):
1. The application is profiled with perf. A certain
set of events is collected.
2. Gooda generates intermediate spreadsheets by
analyzing the perf data files.
3. The spreadsheets are converted into an AFDO
profile by our implementation.
Figure 1: Profile process
4. GCC reads the profile with the AFDO patch and
optimizes the code.
Figure 1 shows an overview of these steps and the
intermediate files that are used in the process.
The profile used by this toolchain is an instruction
profile, there is a counter value for every instruction
of the application. This profile does not comprehend
basic blocks. The basic block profile is computed from
the instruction profile inside GCC by AFDO. This pro-
file also includes the entry count and execution count
for each function.
Two modes have been implemented (Cycles Count-
ing and LBR), see Section 3.2 and Section 3.3. In
both modes, an instruction profile is generated with a
counter for each instruction.
Gooda being able to merge several spreadsheets, it
is possible to collect profiles on several machines of a
cluster for instance and then combine all of them to
have an accurate global profile. This can also be used
when the same executable is run with different data
sets to merge the resulting profiles.
3.1 AutoFDO
AutoFDO (AFDO)[4] is a patch for GCC developed
by Google . AFDO is the successor of SampleFDO
[5]. It has been rewritten from scratch and has several
major differences. SampleFDO was reconstructing
the CFG for each function using a Minimum Control
Flow (MCF) algorithm. It uses a complex two-phase
annotation pass to handle inlined functions. These
two techniques were very expensive and complicated,
therefore they have been abandoned in AFDO.
AFDO uses source profile information. The profile
maps each instruction to an inline stack, itself mapped
to runtime information. An inline stack is a stack of
functions that have been inlined at a specific call site.
The execution count and the number of instructions
mapped for each source instruction are collected at
runtime. Debug information are used to reconstruct
the profile from the runtime profile. While AFDO has
been developed for perf, it is independent from it.
The profile could be generated from another hardware
events profiler.
AFDO is activated using a command-line switch. A
special pass is made to read the profile and load it
into custom data structures. The first special use of
the profile data is made during the early inline pass.
To make the profile annotation easier, AFDO ensures
that each hot call site that was inlined in the profiled
binary is also inlined during this pass. For this, a
threshold-based top-down decision is used, during a
bottom-up traversal of the call graph.
Once early inlining decisions have been made, the
execution profile is annotated using the AFDO pro-
file. The basic block counts are directly taken from
the profile, whereas the edge counts are propagated
from the execution counts. The strength of AFDO is
that it already profits from all the GCC optimization
that are using the profile. When PGO is used, the
static profile used in optimization passes is replaced
with a profile generated from the AFDO profile. All
backend optimization passes use profile information
just as normally.
However, some special tuning still needs to be done.
For instance, during Interprocedural Register Alloca-
tion (IRA), the function is considered to not be called
if its entry basic block’s count is zero. However, in
sampling-based PGO, this is not necessary the case
and special care is taken to ensure that function is not
considered dead. The same problem arises in the hot
call site heuristic where the entry count of the callee
function is checked for zero. In this case, the heuristic
is disabled if it is zero.
AFDO is especially targeting C/C++ applications
and therefore is specially tuning callgraph optimiza-
3.2 Cycles Counting
In this first mode, the counter that is used is the num-
ber of Unhalted Core Cycles for each instruction. It
is a common measure of performance, as it computes
only the time when the processor is doing something,
not when it is waiting for another operation, I/O for
This mode is based on common Cycle Accounting
Analysis for Intel processors[9].
In this mode, the instruction profile is naturally gen-
erated as events are directly mapped to an instruction.
3.3 LBR Mode
To have a better accuracy, the second mode uses the
Last Branch Record (LBR) feature of the PMU.
The LBR is a trace branch buffer. It captures
the source and target addresses of each retired taken
branch. It can track 16 pairs of addresses. It provides
a call tree context for every event for which LBR is
In this implementation, we do not directly manip-
ulate the LBR data. We take advantage of the fact
that Gooda already merges together the LBR samples
to compute the number of basic block executions.
The counter used in LBR mode is the number of
Branch Instruction Retired3. It is an interesting event
because it makes it easy to compute the number of exe-
cution of each basic-block as it references each branch.
By using the 16 addresses of the LBR history, the basic
block paths can be computed with high accuracy.
The instruction profile is generated from the basic
block profile, every instruction of a basic block having
the same counter value.
3.4 Gathering the profile
The profile generated by perf and preprocessed by
Gooda is a binary instruction profile. It means that
each instruction is identified by its address. This ad-
dress is not useful inside GCC, because during the
optimization process, instructions are not assigned an
address. To identify an instruction inside GCC, it is
necessary to have its source position.
Each binary instruction address must be mapped to
its source location. For that, the debug symbols are
extracted from the ELF executable and then are used
to reconstruct the source profile..
Each instruction is identified by four different val-
ues: its filename, the containing function name, its
line number and its discriminator.
The DWARF discriminator allows to discriminate
different statements on the same line. This is very
common in modern programming languages. Dis-
criminators are gathered on the executable using
addr2line for each instruction.
Another important point to distinguish instructions
is the handling of inlined functions. When a func-
tion gets inlined, there is a copy of the instructions
of the called function at several places in the binary.
All those different copies have the same source loca-
tion debug symbols, so they are not enough to dis-
tinguish them. Fortunately, DWARF symbols include
the inline stack for each inlined function. By storing
the inline stack of each instruction, the profile is very
accurate and provides a correct mapping between bi-
nary instructions and source instructions. These inline
stacks are then processed directly by AFDO. This pro-
cessing proved very important on large C++ projects.
3an instruction is retired when it has been executed and its
result is used
The last point of importance concerning the cre-
ation of the profile is the function name. The name
of a function is generally not enough to identify it
uniquely. When using GCC, a function has two more
names: the mangled (or assembler) name identifies
uniquely a function (take into account all the param-
eters) and the Binary File Descriptor (BFD) name,
used by GCC in the debug symbols to identify func-
tions in inline stack. To identify each source function,
its assembler name is taken from the table of symbols
of the ELF file.
3.5 Shortcomings
The present implementation has some limitations.
First of all, the debug symbols are absolutely essential
in order to reconstruct the correct instruction profile.
Indeed, without debug symbols, it would not be pos-
sible to match assembly instructions to their original
source locations. Our implementation works only for
programs that have been compiled in debug mode, but
optimizations can be enabled. Debug executables are
not slower than stripped executables, but they can
be much bigger. This can be overcome by keeping a
stripped copy of the optimized and use it on produc-
tion after PGO. However, the production binary could
not be used for profiling.
Moreover, if the compiler does not generate pre-
cise debug symbols, the profile will not be accurate.
This is especially a problem as some optimizations
are not preserving correctly the debug symbols. An-
other problem comes in the form of over or under sam-
pling. For instance, Loop Unrolling may cause the
same statements to be duplicated in different basic
blocks. Normalization may then lead to a profile too
low for the generated basic blocks.
It is also highly depending on addr2line and
objdump to gather information about the binary file.
To have all the features, it is necessary to possess a
very recent version of binutils, at least 2.23.1. More-
over, if there is a bug in either of these tools, the profile
generated could be inaccurate.
Another shortcoming comes from the fact that the
DWARF debugging data format does not support dis-
criminators in inline stacks. It means that the profile
is not completely precise and it can lead the compiler
to the wrong decision, even if this has not been found
to be an issue in practice.
4 Experimental Results
The implementation has been evaluated in terms of
speedups compared to the optimized version and to
the instrumentation-based PGO version. All binaries
were produced using a trunk version of the Google
GCC 4.7 branch. The target was an x86 64 architec-
ture. The processor used for the tests was an IntelR
XeonTM E5-2650, 2 GHz.
Four versions are compared:
base: The optimized executable, compiled with
-O2 -march=native. The same flags are also
used for the other executables with additional
instr: The executable trained with
instrumentation-based PGO.
ucc: The program trained with our im-
plementation in Cycle Accounting mode.
UNHALTED CORE CYCLES is sampled with a period
of 2’000’000 events.
lbr: The program trained with our implemen-
tation in LBR mode. BRANCH INST RETIRED is
sampled with a period of 400’000 events.
All the results were collected on the SPEC CPU
2006 V1.2 benchmarks4. Each version of the binary is
run three times and SPEC choose the best result out
of the three runs. The variation between the runs is
very low.
For run time reasons, a subset of the benchmarks
has been selected. This subset has been chosen to
be representative of different programming languages,
both floating points and integers programs and to rep-
resent different fields of science.
For these tests, a modified Gooda version was used
to work on small programs. Gooda is especially made
for big programs and so the profile is not generated
for functions below some thresholds. It has been nec-
essary to change the thresholds in order to include as
much functions as possible in the profile.
Figure 2 shows the performance of the different
PGO versions. Each version is compared to the per-
formance of the base version.
The results are very interesting. Indeed, LBR
achieves about 75% of the instrumentation-based
PGO gains (arithmetic average of percentages). Cycle
Accounting is less effective, but still achieves 53% of
the gains. This was expected as LBR should be more
accurate than Unhalted Core Cycles.
In some cases, sampling-based PGO even outper-
forms the traditional approach. For instance, on as-
tar benchmarks, LBR achieves 116% to 129% of the
instrumentation gains. This is even more true for
calculix and xalancbmk where instrumentation-based
PGO performs very poorly and sampling achieves
good results. This difference is not so surprising, as
several optimizations are driven by threshold based
heuristics, so small differences in the profile can dras-
tically change decisions and lead to better (or worse)
On the contrary, there are also benchmarks where
our implementation performs poorly compared to tra-
ditional PGO. For instance, bwaves proved a very bad
case for our implementation. It comes from the fact
Speedup [%]
Figure 2: Speedups for SPEC CPU 2006 benchmarks. The application is trained with training data set. Our
implementation achieves 75% of instrumented PGO.
that AutoFDO is not optimized for Fortran code, since
it has been developed with a C/C++ centric approach.
Indeed, special care has been taken to tune inlining
and call optimization passes, while tuning for loop-
intensive code has not been performed.
The first results were obtained by training the exe-
cutable on the training data set. To see the impact of
the input data set, the executables were then trained
again on the reference data set. It means that the
same input data set is used for training and for bench-
marking the executable. Figure 3 shows the speedups
obtained when training on the reference data set.
This time, LBR is able to achieve 84% of the instru-
mentation gains. An interesting point about these re-
sults is that where instr and ucc improve their scores
by about 22%, lbr improves by 37%. LBR-sampling
seems to be even more interesting when the input data
is closer to the real data set. Most of the benchmarks
improved only a bit with the reference data set, but
xalancbmk improved by an order of magnitude. It
seems that in its case, the training data set is not
close enough to the reference data set for PGO. cal-
culix seems to be in the same case, although the dif-
ference is not so spectacular.
As AFDO has been especially tuned for C++, Fig-
ure 3b presents the results for C++ benchmarks only.
On C++ benchmarks, both sampling versions are
performing very well. Cycle Accounting achieves 67%
of the instrumentation gains and LBR reaches 93%.
These results are very promising. For each benchmark,
lbr is very close (and sometimes better) to instr.
This section presented some results that can still
be improved, especially for some non-C++ bench-
marks. Once AFDO has improved support for other
languages, like Fortran, it may be interesting to run
these tests again to see how much of the instrumenta-
tion gains sampling-based PGO can reach. It may also
be interesting to investigate the benchmarks where
lbr proved better than instr and see if the result
can be obtained on other benchmarks as well.
4.1 Sampling Overhead
The overhead of our implementation has also
been evaluated and compared to the overhead of
instrumentation-based PGO.
Four versions are compared:
base: The optimized executable, compiled with
-O2 -march=native.
instr: The GCOV instrumented executable.
ucc: The program run under the moni-
toring of the profiler with sampling on
UNHALTED CORE CYCLES with a period of
2’000’000 events.
lbr: The program run under the monitoring of
the profiler BRANCH INST RETIRED with a period
of 400’000 events.
The sampling periods used for this test have been
chosen by empiric evaluation and chosing the best one,
based on the speedup result and the overhead.
Figure 4 shows the overhead of the different ver-
sions, compared to the base version.
The overhead of instrumentation is high, but not as
high as expected, only 16% on average. The highest
overhead of instrumentation was 53% on the povray
As expected, the overhead of sampling is lower than
the overhead of instrumentation. LBR has an average
of 1.06% of overhead, which is 15 times lower than
instrumentation. In several benchmarks the overhead
is less than 1%.
Unfortunately, the overhead of Cycle Counting is
much higher than it should be, indeed, it is as high as
Speedup [%]
(a) Non-C++ Programs
(b) C++ programs
Figure 3: Speedups for SPEC CPU 2006 benchmarks. The application is trained with reference data set. Our
implementation achieves 84% of instrumented PGO on overall benchmarks and 93% on C++ programs only.
Overhead [%]
Figure 4: Overheads for SPEC CPU 2006 benchmarks. Our implementation has 15 times less overhead than
instrumentation-based PGO.
10% in average. The problem is that, in this mode,
Gooda profiles a large number of events, not only Un-
halted Core Cycles. This adds a large overhead in
terms of performance. At this time, it is not possible
to configure Gooda to only use Unhalted Core Cycles.
On the other hand, as LBR event proved more accu-
rate and has a small overhead, the value of this mode
is reduced.
Another point of interest is that the variability of
the overhead is much higher for instrumentation than
for sampling. Indeed, the overhead of instrumentation
varies from 0% to 53%, whereas the overhead of LBR-
sampling varies only from 0.3% to 2%. The instru-
mentation overhead highly depends on the code that
is being benchmarked. On the other hand, sampling
overhead depends mostly on the period of sampling
and the type of the sampled events.
Some instrumented executables proved faster than
their optimized counterpart. It may happen in some
cases. After the instrumentation instructions are in-
serted in the program, the program is optimized fur-
ther. These new instructions may change the decisions
that are made during the optimization process. Such
changes in decision may lead to faster code. This sit-
uation cannot happen with sampling.
4.2 Tools Overhead
In the previous section, only the overhead of sampling
(with perf) has been measured. The time necessary
to generate the profile from perf also needs to be taken
into accounts.
There are two different tools adding overhead to the
overall process. The first one, also the slowest one,
being Gooda. Until now, Gooda has not been tuned
for performances and it may be quite slow for han-
dling very large profiles. The conversion from Gooda
spreadsheets to an AFDO profile is not so critical,
since Gooda already filters several functions. More-
over, it has already been tuned for performances so as
to add the smallest possible overhead to the process.
Both overhead have been tested on several pro-
files gathered using perf in cycle accouting mode
(UNHALTED CORE CYCLES with a period of 2’000’000
events). The profiles have been gathered on GCC
compiling two different programs, a toy compiler
and the converter itself. The perf profile are vary-
ing from 167MiB (gcc-google-eddic) and 194KiB
(eddic-list). Each test has been run five times and
the best result has been taken. The variations between
different runs was very low.
Figure 5a presents the overhead of the converter.
As shown in this figure, the overhead of the converter
is not negligible. It takes a maximum of six seconds
for the test cases. It has to be put in regard of the
running time of the profiling. For instance, the test
case gcc-eddic runs during 40 minutes. This makes
an overhead of 0.25%, which is acceptable. An impor-
Overhead [%]
(a) Converter Overhead
(b) Gooda Overhead
Figure 5: Overhead of the profile generation.
tant point to consider is that it does not scale directly
with the size of the profile but with the number of
functions reported by Gooda, which should grow up
to a maximum related to the size of the profiled ex-
ecutable. In the converter, about 65% of the time is
spent in calling addr2line. This could be improved
by integrating address to line conversions directly in
the converter.
Figure 5b shows the overhead of Gooda, converting
the perf profile to a set of spreadsheets. It is very
clear that the overhead of Gooda is much higher than
the overhead of the converter and is considerable for
at least two test cases (both gcc test cases). On the
slowest test case (gcc-eddic), the overhead is as high
as five percent, which makes the tool chain much less
interesting. It also takes several seconds for the other
samples even if they are much faster to run under pro-
filing. The overhead is generally getting better with
the running time of the program. In its current state,
the current toolchain is more adapted to long-running
The long running time of Gooda is something that
should really be improved in the future.
5 Related Work
In 2008, Roy Levin, Ilan Newman and Gadi Haber [8]
proposed a solution to generate edge profiles from in-
struction profiles of the instruction retired hardware
event for the IBM FDPR-Pro, post-link time opti-
mizer. This solution works on the binary level. The
profile is applied to the corresponding basic blocks af-
ter link-time. The construction of the edge profile
from the sample profile is known as a Minimum Cost
Circulation problem. They showed that this can be
solved in acceptable time for the SPEC benchmarks,
but this remains a heavy algorithm.
Soon after Levin et al., Vinodha Ramasamy, Robert
Hundt, Dehao Chen and Wenguang Chen [12] pre-
sented another solution of using instruction retired
hardware events to construct an edge profile. This
solution was implemented and tested in the Open64
compiler. Unlike the previous work, the profile is re-
constructed from the binary using source position in-
formation. This has the advantage that the binary can
be built using any compiler and then used by Open64
to perform PGO. They were able to reach an aver-
age of 80% of the gains that can be obtained with
instrumentation-based PGO.
In 2010, Dehao Chen et al. [5] continued the work
started in Open64 and adapted it for GCC. In this
work, several optimizations of GCC were specially
adapted to the use of sampling profiles. The ba-
sic block and edges frequencies are derived using a
Minimum Control Flow algorithm. In this solution,
the Last Branch Record (LBR) precise sampling fea-
ture of the processor was used to improve the accu-
racy of the profile. Moreover, they also used a spe-
cial version of the Lightweight Interprocedural Opti-
mizer (LIPO) of GCC. The value profile is also derived
from the sample profile using PEBS mode. With all
these optimizations put together, they were able to
achieve an average of 92% of the performance gains of
instrumentation-based Feedback Directed Optimiza-
tion (FDO).
More recently, Dehao Chen (Google) released Aut-
oFDO5(AFDO), on which our solution is based. It is a
patch for GCC to handle sampling-based profiles. The
profile is represented by a GCOV file, containing func-
tion profiles. Several optimizations of GCC have been
reviewed to handle more accurately this new kind of
profile. The profile is generated from the debug infor-
mation contained into the executable and the samples
are collected using the perf tool. AutoFDO is espe-
cially made to support optimized binary. For the time
being, AFDO does not handle value profiles. Only
the GCC patch has been released so far, no tool to
generate the profile has been released.
More on the side of Performance Counters, Vincent
M. Weaver shown that, when the setup is correctly
tuned, the values of the performance counters have
a very small variation between different runs (0.002
percent on the SPEC benchmarks). Nonetheless, very
subtle changes in the setup can result in large varia-
tions in the results[14].
Other sampling approaches without using perfor-
mance counters have been proposed. For instance,
The Morph system use statistical sampling of the pro-
gram counter to collect profiles[15]. In another solu-
tion, kernel instructions were used to sample the con-
tents of the branch-prediction unit of the hardware[6].
These two solutions requires that additional informa-
tion be encoded into the binary to correlates samples
to the compiler’s Intermediate Representation.
Performance Counters also start to be used in other
areas than Profile-Guided Optimization. For instance,
5 09/msg01941.
Schneider et al. sample performance counters to op-
timize data locality in VM with garbage collector[13].
In this solution, the collected data were used to driven
online optimizations in the Just-In-Time (JIT) com-
piler of the VM.
6 Conclusion and Future Work
We designed and implemented a toolchain to use
Hardware Event sampling to drive Profile-Guided Op-
timization inside GCC. Our implementation proved
to be competitive with instrumentation-based PGO
in terms of performance, achieving 93% of the gains
of traditional PGO and in terms of speed, having 15
times less overhead. The experiments show that this
technique is already usable in production. However,
its high performance is currently limited to C++ pro-
grams. Some work would have to be achieved to ex-
tend the current toolchain to support more program-
ming languages. For that, the most important changes
will need to be done in AFDO.
Instrumentation-based PGO has still an advantage
over our implementation. It can generate value pro-
files. This kind of features is not yet supported by our
toolchain. However, it has already been implemented
with sampling-based PGO in [5] and it something that
was currently being developed in AFDO during our
project, it should be integrated in the toolchain itself.
The presented toolchain makes it easy to handle
new events. These events may lead to implementa-
tion of novel optimizations. Of course, sampling more
events also incurs more overhead during profiling. Ex-
periments have been made to integrate Load Latency
events into GCC. The problem being that the new
information is hard to use into existing or new opti-
mization techniques. We implemented a Loop Fusion
pass for GCC taking the Load Latency into account
in its decision heuristics[11]. The main difficulty with
Load Latency events being that they are not accurate
enough at basic block level. To make better use of
these events, it would be necessary to have a profile
on an instruction level inside the compiler.
Some work will also have to be done to improve the
speed of Gooda for large profiles, that is currently too
7 Acknowledgments
We want to thank the reviewers for their comments
and corrections regarding this paper. We would like
to thank Stephane Eranian for his help in using the
perf profiler and for providing kernel patches for perf
events and libpfm.
A Implementation
The implementation of the profile generator is avail-
able on Github (
The usage of the toolchain is described on the repos-
itory home page.
[1] Thomas Ball and James R. Larus. Optimally pro-
filing and tracing programs. ACM Transactions
on Programming Languages and Systems, 16:59–
70, 1994.
[2] Thomas Ball and James R. Larus. Efficient path
profiling. In In Proceedings of the 29th Annual
International Symposium on Microarchitecture,
pages 46–57, 1996.
[3] Paolo Calafiura, Stephane Eranian, David
Levinthal, Sami Kama, and Roberto Agostino Vi-
tillo. Gooda: The generic optimization data an-
alyzer. Journal of Physics: Conference Series,
396(5):052072, 2012.
[4] Dehao Chen, Neil Vachharajani, Robert Hundt,
Xinliang Li, Stephane Eranian, Wenguang Chen,
and Weimin Zheng. Taming hardware event sam-
ples for precise and versatile feedback directed op-
timizations. IEEE Transactions on Computers,
62(2):376–389, 2013.
[5] Dehao Chen, Neil Vachharajani, Robert Hundt,
Shih-wei Liao, Vinodha Ramasamy, Paul Yuan,
Wenguang Chen, and Weimin Zheng. Taming
hardware event samples for fdo compilation. In
Proceedings of the 8th annual IEEE/ACM inter-
national symposium on Code generation and opti-
mization, CGO ’10, pages 42–52, New York, NY,
USA, 2010. ACM.
[6] Thomas M. Conte, Burzin Patel, Kishore N.
Menezes, and J. Stan Cox. Hardware-based pro-
filing: An effective technique for profile-driven op-
timization, 1996.
[7] Arnaldo Carvalho de Melo. The new linux-
perftools. In Slides from Linux Kongress, 2010.
[8] Roy Levin, Ilan Newman, and Gadi Haber. Com-
plementing missing and inaccurate profiling us-
ing a minimum cost circulation algorithm. In
Proceedings of the 3rd international conference
on High performance embedded architectures and
compilers, HiPEAC’08, pages 291–304, Berlin,
Heidelberg, 2008. Springer-Verlag.
[9] David Levinthal. Cycle Accounting Analysis on
Intel R
CoreTM 2 Processors. Technical report,
Intel Corp., 2008.
[10] David Levinthal. Performance Analysis Guide for
Intel R
CoreTM i7 and Intel R
XeonTM 5500 pro-
cessors. Technical report, Intel Corp., 2009.
[11] Omitted. Omitted. Master’s thesis, Omitted,
Omitted, 2013.
[12] Vinodha Ramasamy, Paul Yuan, Dehao Chen,
and Robert Hundt. Feedback-directed optimiza-
tions in gcc with estimated edge profiles from
hardware event sampling. In Proceedings of GCC
Summit 2008, pages 87–102, 2008.
[13] Florian T. Schneider, Mathias Payer, and
Thomas R. Gross. Online optimizations driven
by hardware performance monitoring. In Pro-
ceedings of the 2007 ACM SIGPLAN conference
on Programming language design and implemen-
tation, PLDI ’07, pages 373–382, New York, NY,
USA, 2007. ACM.
[14] Vincent M. Weaver and Sally A. McKee. Can
hardware performance counters be trusted? In
IISWC, pages 141–150, 2008.
[15] Xiaolan Zhang, Zheng Wang, Nicholas Gloy,
J. Bradley Chen, and Michael D. Smith. System
support for automatic profiling and optimization.
In Proceedings of the sixteenth ACM symposium
on Operating systems principles, SOSP ’97, pages
15–26, New York, NY, USA, 1997. ACM.
... Перейдем к рассмотрению технологии оптимизации на основе профилирования (англ.profile-guided optimization) -подхода к оптимизации программ, в котором процессом оптимизации управляют результаты быстродействия программы [24,25]. Этот подход зачастую требует использования специального компилятора, который берет на себя задачу инструментации и снятия замеров времени исполнения программы. ...
Full-text available
General Purpose computing for Graphical Processing Units (GPGPU) technology is a powerful tool for offloading parallel data processing tasks to Graphical Processing Units (GPUs). This technology finds its use in variety of domains – from science and commerce to hobbyists. GPU-run general-purpose programs will inevitably run into performance issues stemming from code branch predication. Code predication is a GPU feature that makes both conditional branches execute, masking the results of incorrect branch. This leads to considerable performance losses for GPU programs that have large amounts of code hidden away behind conditional operators. This paper focuses on the analysis of existing approaches to improving software performance in the context of relieving the aforementioned performance loss. Description of said approaches is provided, along with their upsides, downsides and extents of their applicability and whether they address the outlined problem. Covered approaches include: optimizing compilers, JIT-compilation, branch predictor, speculative execution, adaptive optimization, run-time algorithm specialization, profile-guided optimization. It is shown that the aforementioned methods are mostly catered to CPU-specific issues and are generally not applicable, as far as branch-predication performance loss is concerned. Lastly, we outline the need for a separate performance improving approach, addressing specifics of branch predication and GPGPU workflow.
Hardware performance monitoring units (PMUs) are a standard feature in modern microprocessors, providing a rich set of microarchitectural event samplers. Recently, numerous profile-guided optimization (PGO) frameworks have exploited them to feature much lower profiling overhead compared to conventional instrumentation-based frameworks. However, existing PGO frameworks mainly focus on optimizing the layout of binaries; they overlook rich information provided by the PMU about data access behaviors over the memory hierarchy. Thus, we propose MaPHeA, a lightweight M emory hierarchy- a ware P rofile-guided He ap A llocation framework applicable to both HPC and embedded systems. MaPHeA guides and applies the optimized allocation of dynamically allocated heap objects with very low profiling overhead and without additional user intervention to improve application performance. To demonstrate the effectiveness of MaPHeA, we apply it to optimizing heap object allocation in an emerging DRAM-NVM heterogeneous memory system (HMS), selective huge-page utilization, and controlling the cacheability of the objects with the low temporal locality. In an HMS, by identifying and placing frequently accessed heap objects to the fast DRAM region, MaPHeA improves the performance of memory-intensive graph-processing and Redis workloads by 56.0% on average over the default configuration that uses DRAM as a hardware-managed cache of slow NVM. By identifying large heap objects that cause frequent TLB misses and allocating them to huge pages, MaPHeA increases the performance of the read and update operations of Redis by 10.6% over the transparent huge-page implementation of Linux. Also, by distinguishing the objects that cause cache pollution due to their low temporal locality and applying write-combining to them, MaPHeA improves the performance of STREAM and RADIX workloads by 20.0% on average over the system without cacheability control.
Modern hardware features can boost the performance of an application, but software vendors are often limited to the lowest common denominator to maintain compatibility with the spectrum of processors used by their clients. Given more detailed information about the hardware features, a compiler can generate more efficient code, but even if the exact CPU model is known, manufacturer confidentiality policies leave substantial uncertainty about precise performance characteristics. In addition, the effectiveness of many optimization techniques can vary depending on the inputs to the program. This thesis introduces two tools, FITTCHOOSER and OFSPER, to do function-level optimizations most suitable for the current runtime environment and data. FITTCHOOSER dynamically explores specializations of a program’s most processor-intensive functions to choose the fittest version—not just specific to the current runtime environment, but also specific to the current execution of the program. OFSPER applies dynamic function specialization, applying function specialization on a running process, to an application. This technique captures the actual values of arguments during execution of the program and, when profitable, creates specialized versions and include them at runtime.
Performance optimizations of large scale services can lead to significant wins on service efficiency and performance. CPU resource is one of the most common performance bottlenecks, hence improving CPU performance has been the focus of many performance optimization efforts. In particular, reducing iTLB (instruction TLB) miss rates can greatly improve CPU performance and speed up service running.
Full-text available
Profile based optimization can be used for instruction scheduling, loop scheduling, data preloading, function in-lining, and instruction cache performance enhancement. However, these techniques have not been embraced by software vendors because programs instrumented for profiling run significantly slower, an awkward compile-run-recompile sequence is required, and a test input suite must be collected and validated for each program. This paper introduces hardware-based profiling that uses traditional branch handling hardware to generate profile information in real time. Techniques are presented for both one-level and two-level branch hardware organizations. The approach produces high accuracy with small slowdown in execution (0.4%-4.6%). This allows a program to be profiled while it is used, eliminating the need for a test input suite. With contemporary processors driven increasingly by compiler support, hardware-based profiling is important for high-performance systems.
Full-text available
Traditional feedback-directed optimization (FDO) uses static instrumentation to collect profiles. This method has shown good application performance gains, but is not commonly used in practice due to the high runtime overhead of pro- file collection, the tedious dual-compile usage model, and difficulties in generating representative training data se ts. In this paper, we show that edge frequency estimates can be successfully constructed with heuristics using profile dat a collected by sampling of hardware events, incurring low run- time overhead (e.g., less then 2%), and requiring no instru- mentation, yet achieving competetive performance gains. Our initial results show a 3-4% performance gain on the SPEC C benchmarks.
Conference Paper
Full-text available
When creating architectural tools, it is essential to know whether the generated results make sense. Comparing a toolpsilas outputs against hardware performance counters on an actual machine is a common means of executing a quick sanity check. If the results do not match, this can indicate problems with the tool, unknown interactions with the benchmarks being investigated, or even unexpected behavior of the real hardware. To make future analyses of this type easier, we explore the behavior of the SPEC benchmarks with both dynamic binary instrumentation (DBI) tools and hardware counters. We collect retired instruction performance counter data from the full SPEC CPU 2000 and 2006 benchmark suites on nine different implementations of the times86 architecture. When run with no special preparation, hardware counters have a coefficient of variation of up to 1.07%. After analyzing results in depth, we find that minor changes to the experimental setup reduce observed errors to less than 0.002% for all benchmarks. The fact that subtle changes in how experiments are conducted can largely impact observed results is unexpected, and it is important that researchers using these counters be aware of the issues involved.
Conference Paper
Full-text available
Edge profiling is a very common means for providing feedback on program behavior that can be used statically by an optimizer to produce highly optimized binaries. However collecting full edge profile carries a significant runtime overhead. This overhead creates addition problems for real-time applications, as it may prevent the system from meeting runtime deadlines and thus alter its behavior. In this paper we show how a low overhead sampling technique can be used to collect inaccurate profile which is later used to approximate the full edge profile using a novel technique based on the Minimum Cost Circulation Problem. The outcome is a machine independent profile gathering scheme that creates a slowdown of only 2%-3% during the training set, and produces an optimized binary which is only 0.6% less than a fully optimized one.
Conference Paper
Full-text available
Feedback-directed optimization (FDO) is effective in improv- ing application runtime performance, but has not been widely adopted due to the tedious dual-compilation model, the dif- ficulties in generating representative training data sets, and the high runtime overhead of profile collection. The use of hardware-event sampling to generate estimated edge pro- files overcomes these drawbacks. Yet, hardware event sam- ples are typically not precise at the instruction or basic-block granularity. These inaccuracies lead to missed performance when compared to instrumentation-based FDO. In this pa- per, we use multiple hardware event profiles and supervised learning techniques to generate heuristics for improved pre- cision of basic-block-level sample profiles, and to further im- prove the smoothing algorithms used to construct edge pro- files. We demonstrate that sampling-based FDO can achieve an average of 78% of the performance gains obtained us- ing instrumentation-based exact edge profiles for SPEC2000 benchmarks, matching or beating instrumentation-based FDO in many cases. The overhead of collection is only 0.74% on average, while compiler based instrumentation incurs 6.8%-53.5% overhead (and 10x overhead on an industrial web search application), and dynamic instrumentation incurs 28.6%-1639.2% overhead.
Feedback-directed optimization (FDO) is effective in improving application runtime performance, but has not been widely adopted due to the tedious dual-compilation model, the difficulties in generating representative training data sets, and the high runtime overhead of profile collection. The use of hardware-event sampling overcomes these drawbacks by providing a lightweight approach to collect execution profiles in the production environment, which naturally consumes representative input. Yet, hardware event samples are typically not precise at the instruction or basic-block granularity. These inaccuracies lead to missed performance when compared to instrumentation-based FDO. In this paper, we use Performance Monitoring Unit (PMU)-based sampling to collect the instruction frequency profiles. By collecting profiles using multiple events, and applying heuristics to predict the accuracy, we improve the accuracy of the profile. We also show how emerging techniques can be used to further improve the accuracy of the sample-based profile. Additionally, these emerging techniques are used to collect value profiles, as well as to assist a lightweight interprocedural optimizer. All these profiles are represented in a portable form, thus they can be used across different platforms. We demonstrate that sampling-based FDO can achieve an average of 92 percent of the performance gains obtained using instrumentation-based exact profiles for both SPEC CINT2000 and CINT2006 benchmarks. The overhead of collection is only 0.93 percent on average, while compiler-based instrumentation incurs 2.0-351.5 percent overhead (and 10x overhead on an industrial web search application).
Modern superscalar, out-of-order microprocessors dominate large scale server computing. Monitoring their activity, during program execution, has become complicated due to the complexity of the microarchitectures and their IO interactions. Recent processors have thousands of performance monitoring events. These are required to actually provide coverage for all of the complex interactions and performance issues that can occur. Knowing which data to collect and how to interpret the results has become an unreasonable burden for code developers whose tasks are already hard enough. It becomes the task of the analysis tool developer to bridge this gap. To address this issue, a generic decomposition of how a microprocessor is using the consumed cycles allows code developers to quickly understand which of the myriad of microarchitectural complexities they are battling, without requiring a detailed knowledge of the microarchitecture. When this approach is intrinsically integrated into a performance data analysis tool, it enables software developers to take advantage of the microarchitectural methodology that has only been available to experts. The Generic Optimization Data Analyzer (GOoDA) project integrates this expertise into a profiling tool in order to lower the required expertise of the user and, being designed from the ground up with large-scale object-oriented applications in mind, it will be particularly useful for large HENP codebases
The Morph system provides a framework for automatic collection and management of profile information and application of profile-driven optimizations. In this paper, we focus on the operating system support that is required to collect and manage profile information on an end-user's workstation in an automatic, continuous, and transparent manner. Our implementation for a Digital Alpha machine running Digital UNIX 4.0 achieves run-time overheads of less than 0.3% during profile collection. Through the application of three code layout optimizations, we further show that Morph can use statistical profiles to improve application performance. With appropriate system support, automatic profiling and optimization is both possible and effective.
Conference Paper
Hardware performance monitors provide detailed direct feedback about application behavior and are an additional source of infor-mation that a compiler may use for optimization. A JIT compiler is in a good position to make use of such information because it is running on the same platform as the user applications. As hardware platforms become more and more complex, it becomes more and more difficult to model their behavior. Profile information that captures general program properties (like execution frequency of methods or basic blocks) may be useful, but does not capture sufficient information about the execution platform. Machine-level performance data obtained from a hardware performance monitor can not only direct the compiler to those parts of the program that deserve its attention but also determine if an optimization step actually improved the performance of the application. This paper presents an infrastructure based on a dynamic compiler+runtime environment for Java that incorporates machine-level information as an additional kind of feedback for the compiler and runtime environment. The low-overhead monitoring system provides fine-grained performance data that can be tracked back to individual Java bytecode instructions. As an example, the paper presents results for object co-allocation in a generational garbage collector that optimizes spatial locality of objects on-line using measurements about cache misses. In the best case, the execution time is reduced by 14% and L1 cache misses by 28%.
Conference Paper
A path profile determines how many times each acyclic path in a routine executes. This type of profiling subsumes the more common basic block and edge profiling, which only approximate path frequencies. Path profiles have many potential uses in program performance tuning, profile-directed compilation, and software test coverage. This paper describes a new algorithm for path profiling. This simple, fast algorithm selects and places profile instrumentation to minimize run-time overhead. Instrumented programs run with overhead comparable to the best previous profiling techniques. On the SPEC95 benchmarks, path profiling overhead averaged 31%, as compared to 16% for efficient edge profiling. Path profiling also identifies longer paths than a previous technique, which predicted paths from edge profiles (average of 88, versus 34 instructions). Moreover, profiling shows that the SPEC95 train input datasets covered most of the paths executed in the ref datasets