Hardware Counted Proﬁle-Guided Optimization
Baptiste Wicht (EIA-FR)
Roberto A. Vitillo
Dehao Chen (Google)
David Levinthal (Google)
Proﬁle-Guided Optimization (PGO) is an excellent
means to improve the performance of a compiled pro-
gram. Indeed, the execution path data it provides
helps the compiler to generate better code and better
At the time of this writing, compilers only support
instrumentation-based PGO. This proved eﬀective for
optimizing programs. However, few projects use it,
due to its complicated dual-compilation model and its
high overhead. Our solution of sampling Hardware
Performance Counters overcome these drawbacks. In
this paper, we propose a PGO solution for GCC by
sampling Last Branch Record (LBR) events and using
debug symbols to recreate source locations of binary
By using LBR-Sampling, the generated proﬁles are
very accurate. This solution achieved an average of
83% of the gains obtained with instrumentation-based
PGO and 93% on C++ benchmarks only. The proﬁl-
ing overhead is only 1.06% on average whereas instru-
mentation incurs a 16% overhead on average.
Keywords Proﬁle-Guided Optimization, Sam-
pling, Hardware Performance Counters, Compilers
Proﬁlers help developers and compilers ﬁnd the main
areas for optimization. Proﬁling and optimizing by
hand is a time-consuming process, but this process
can be automated. Modern compilers include an op-
timization technique called Proﬁle-Guided Optimiza-
tion (PGO). Feedback-Directed Optimization (FDO)
is also used as a synonym of PGO.
PGO uses information collected during the execu-
tion of a program to optimize it. Generally, edge exe-
cution frequencies between basic blocks are collected.
Several optimization techniques can take advantage
from the collected proﬁle. For instance, the data can
be used to drive inlining decisions and block ordering
within a function to achieve minimal cacheline usage.
Branches can be reordered based on their frequency to
avoid branch misprediction. Loops working on arrays
causing Data Cache misses can be improved to make
better use of the cache. As the dynamic proﬁle, unlike
the static proﬁle, captures execution frequency, this
can result in impressive speedups for non IO-intensive
Compilers currently support instrumentation-based
PGO. In this variant, the compiler must ﬁrst gener-
ate a special version of the application in which in-
structions are inserted at speciﬁc locations to generate
the proﬁle. During the execution, counters are incre-
mented by these instructions and ﬁnally, the proﬁle
is generated into a ﬁle. After that, the program is
compiled again, this time with PGO ﬂags, to use the
proﬁle and optimize the binary for the ﬁnal version.
This approach has several drawbacks:
•The instructions inserted into the program slow
it down. An instrumented binary can be much
slower than its optimized counterpart. This has
been reported to incurs between 9% and 105%
overhead[1, 2]. In practice, it has been observed
to be as much as an order of magnitude slower for
•The proﬁle data must be collected on a specially
compiled executable. This dual-compilation
model is not convenient. Indeed, for applications
with long build time, doubling this time may de-
•Only a small set of information can be collected.
For example, it is not possible with this approach
to collect information about memory or branch-
•There is a tight coupling between the two builds.
It is generally necessary to use the same optimiza-
tion options both in the ﬁrst and second compila-
tion. Without that, the control-ﬂow graph (CFG)
of both compilation may not match and the pro-
ﬁling data may not be used with enough accuracy.
Making large changes in the source code also in-
validates the previous proﬁle data.
•The instrumentation instructions can alter the
quality of the generated proﬁle. As new instruc-
tions are inserted into the program, they may
arXiv:1411.6361v1 [cs.PL] 24 Nov 2014
change the results of the proﬁling, which may, in
turn change the optimization decisions.
For these reasons, traditional PGO has not been
widely adopted. Even most of the CPU-intensive
projects are not using this technique. To avoid these
drawbacks, we propose in this paper a solution based
on sampling Hardware Performance Events generated
by the Performance Monitoring Unit (PMU) instead
of instrumenting the application. Source position con-
tained in the debug symbols is used to recreate an
Below, we list the primary contributions of this
1. We study Hardware Performance Events and
their use for PGO.
2. We build a complete toolchain, based on GCC,
able to perform sampling-based PGO.
3. Finally, we evaluate the performance of our imple-
mentation. We present results obtained with our
implementation in the GCC compiler with SPEC
2006 benchmarks. We show that our toolchain
can achieve 93% of the gains obtained using
instrumentation-based PGO and incurs only a
1.06% overhead, where instrumentation adds 16%
overhead on average.
The rest of this paper is organized as follows: Sec-
tion 2 describes how to combine PGO and sam-
pling Hardware Performance Events. Section 3 then
presents the toolchain that has been developed. Sec-
tion 4 describes the results obtained with sampling-
based PGO. Section 5 lists related work in the area.
Finally, Section 6 presents our conclusions and future
work for this project.
2 PGO and Performance Coun-
This section describes how sampling Hardware Perfor-
mance Counters can be used to perform PGO.
2.1 Hardware Performance Events
Every modern microprocessor includes a set of coun-
ters that are updated when a particular hardware
event arises. This set of counters is managed by the
Performance Monitoring Unit (PMU) of the CPU.
These events can then be accessed by an other ap-
The most common way of using these counters is by
sampling. The PMU has to be conﬁgured to gener-
ate an interrupt when a particular counter goes over
a speciﬁc limit (it overﬂows). At this point, the moni-
toring software can record the state of the system and
especially the current executed instruction, indicated
by the Program Counter (PC). This directly generates
a complete instruction proﬁle for the binary instruc-
A basic block proﬁle can be naturally estimated
from sampling. Each time the counter overﬂows, the
instruction identiﬁed by the PC is saved. At the end
of the execution, the number of samples for each in-
struction of a basic block are summed. The basic block
sums must be normalized to avoid giving higher weight
to larger basic blocks.
There is a lot of diﬀerent events, from clock cy-
cles, to L2 cache misses or branch mispredictions.
The available events depend on the microarchitec-
ture. Some processors provides very large list of events
(more than 1,500 for the PowerPC Power7 family) and
some much fewer (69 for the ARM V7 Cortex-A9 pro-
The solution presented in this paper has been spe-
cially tuned for Intel R
CoreTM i7 events.
2.2 Sampling-Based PGO
Combining PGO and Hardware Performance Coun-
ters results in the sampling-based PGO technique. In
this model, the program that is proﬁled is directly the
production binary, there is no need for a special ex-
ecutable. However, the proﬁle is this time generated
with a speciﬁc program that can sample the values of
the Hardware Performance Counters, namely a pro-
ﬁler. Once a proﬁle is generated, the compiler can use
it for the next compilations.
The main advantage of this approach is the much
smaller overhead of sampling compared to instrumen-
tation. The program is only interrupted when a
counter overﬂows, not every time a function is exe-
cuted for instance. The cost of sampling depends on
the sampling period and on the event that is sampled.
Moreover, the proﬁler can be patched on an already
running program. It means that the production exe-
cutable can be proﬁled for some hours without inter-
rupting it. Proﬁling on the production binary, with
the production input data, generally results in more
accurate proﬁles. Moreover, it is not necessary to ﬁnd
training data for the instrumentation executable, a
task which can be hard depending on the proﬁled ex-
Since source position is used to match the program
and the proﬁle, the coupling between the proﬁle and
the binary is much smaller. Changing the compilation
options would not invalidate the proﬁle. Moreover,
older proﬁles can still be used in new versions of the
application. It is not necessary to generate proﬁles
for each version of the program, except in the case of
major changes in the application.
The data generated by the proﬁler must be trans-
formed before being used for PGO. The samples con-
tain only the address of the instructions and the sam-
ple count. Information such as the source location of
the instruction is necessary to generate a real instruc-
tion proﬁle useful for the compiler from this raw data.
This is explained in details in Section 3.4
On the other hand, there are also some drawbacks
to this method:
•As the supported events are depending on the mi-
croarchitecture, sampling very speciﬁc events may
not be portable
•As not all events are recorded, this method is
not as accurate as instrumentation-based proﬁle.
In practice, it showed to be accurate enough for
bringing performance speedups.
•The sampling period must be chosen carefully. In-
deed, a longer sampling period means a larger
overhead, but a more accurate proﬁle. It is im-
portant to ﬁnd the correct balance between the
two factors. After a certain point, it is not really
interesting to sample more events.
•It may occur that the sampling period is synchro-
nized with some part of the code which can lead
to the point where only one instruction of a loop
is reported . This problem can be solved by
using randomization on the period.
Even with these drawbacks, sampling-based PGO
is a better ﬁt in the development process than the
Our implementation relies on the following compo-
1. perf: The Hardware Performance Counters
proﬁler of the Linux kernel.
2. Gooda1: An analyzer for perf data ﬁles. It
is the result of a collaboration between the
Lawrence Berkeley National Laboratory (LBNL)
3. AutoFDO2(AFDO): A patch from Google bring-
ing sampling-based PGO to GCC. See Section 3.1
In this implementation, PGO is made in several
steps (the ﬁrst three steps being automated in one
for the user by the proﬁle generator):
1. The application is proﬁled with perf. A certain
set of events is collected.
2. Gooda generates intermediate spreadsheets by
analyzing the perf data ﬁles.
3. The spreadsheets are converted into an AFDO
proﬁle by our implementation.
Figure 1: Proﬁle process
4. GCC reads the proﬁle with the AFDO patch and
optimizes the code.
Figure 1 shows an overview of these steps and the
intermediate ﬁles that are used in the process.
The proﬁle used by this toolchain is an instruction
proﬁle, there is a counter value for every instruction
of the application. This proﬁle does not comprehend
basic blocks. The basic block proﬁle is computed from
the instruction proﬁle inside GCC by AFDO. This pro-
ﬁle also includes the entry count and execution count
for each function.
Two modes have been implemented (Cycles Count-
ing and LBR), see Section 3.2 and Section 3.3. In
both modes, an instruction proﬁle is generated with a
counter for each instruction.
Gooda being able to merge several spreadsheets, it
is possible to collect proﬁles on several machines of a
cluster for instance and then combine all of them to
have an accurate global proﬁle. This can also be used
when the same executable is run with diﬀerent data
sets to merge the resulting proﬁles.
AutoFDO (AFDO) is a patch for GCC developed
by Google . AFDO is the successor of SampleFDO
. It has been rewritten from scratch and has several
major diﬀerences. SampleFDO was reconstructing
the CFG for each function using a Minimum Control
Flow (MCF) algorithm. It uses a complex two-phase
annotation pass to handle inlined functions. These
two techniques were very expensive and complicated,
therefore they have been abandoned in AFDO.
AFDO uses source proﬁle information. The proﬁle
maps each instruction to an inline stack, itself mapped
to runtime information. An inline stack is a stack of
functions that have been inlined at a speciﬁc call site.
The execution count and the number of instructions
mapped for each source instruction are collected at
runtime. Debug information are used to reconstruct
the proﬁle from the runtime proﬁle. While AFDO has
been developed for perf, it is independent from it.
The proﬁle could be generated from another hardware
AFDO is activated using a command-line switch. A
special pass is made to read the proﬁle and load it
into custom data structures. The ﬁrst special use of
the proﬁle data is made during the early inline pass.
To make the proﬁle annotation easier, AFDO ensures
that each hot call site that was inlined in the proﬁled
binary is also inlined during this pass. For this, a
threshold-based top-down decision is used, during a
bottom-up traversal of the call graph.
Once early inlining decisions have been made, the
execution proﬁle is annotated using the AFDO pro-
ﬁle. The basic block counts are directly taken from
the proﬁle, whereas the edge counts are propagated
from the execution counts. The strength of AFDO is
that it already proﬁts from all the GCC optimization
that are using the proﬁle. When PGO is used, the
static proﬁle used in optimization passes is replaced
with a proﬁle generated from the AFDO proﬁle. All
backend optimization passes use proﬁle information
just as normally.
However, some special tuning still needs to be done.
For instance, during Interprocedural Register Alloca-
tion (IRA), the function is considered to not be called
if its entry basic block’s count is zero. However, in
sampling-based PGO, this is not necessary the case
and special care is taken to ensure that function is not
considered dead. The same problem arises in the hot
call site heuristic where the entry count of the callee
function is checked for zero. In this case, the heuristic
is disabled if it is zero.
AFDO is especially targeting C/C++ applications
and therefore is specially tuning callgraph optimiza-
3.2 Cycles Counting
In this ﬁrst mode, the counter that is used is the num-
ber of Unhalted Core Cycles for each instruction. It
is a common measure of performance, as it computes
only the time when the processor is doing something,
not when it is waiting for another operation, I/O for
This mode is based on common Cycle Accounting
Analysis for Intel processors.
In this mode, the instruction proﬁle is naturally gen-
erated as events are directly mapped to an instruction.
3.3 LBR Mode
To have a better accuracy, the second mode uses the
Last Branch Record (LBR) feature of the PMU.
The LBR is a trace branch buﬀer. It captures
the source and target addresses of each retired taken
branch. It can track 16 pairs of addresses. It provides
a call tree context for every event for which LBR is
In this implementation, we do not directly manip-
ulate the LBR data. We take advantage of the fact
that Gooda already merges together the LBR samples
to compute the number of basic block executions.
The counter used in LBR mode is the number of
Branch Instruction Retired3. It is an interesting event
because it makes it easy to compute the number of exe-
cution of each basic-block as it references each branch.
By using the 16 addresses of the LBR history, the basic
block paths can be computed with high accuracy.
The instruction proﬁle is generated from the basic
block proﬁle, every instruction of a basic block having
the same counter value.
3.4 Gathering the proﬁle
The proﬁle generated by perf and preprocessed by
Gooda is a binary instruction proﬁle. It means that
each instruction is identiﬁed by its address. This ad-
dress is not useful inside GCC, because during the
optimization process, instructions are not assigned an
address. To identify an instruction inside GCC, it is
necessary to have its source position.
Each binary instruction address must be mapped to
its source location. For that, the debug symbols are
extracted from the ELF executable and then are used
to reconstruct the source proﬁle..
Each instruction is identiﬁed by four diﬀerent val-
ues: its ﬁlename, the containing function name, its
line number and its discriminator.
The DWARF discriminator allows to discriminate
diﬀerent statements on the same line. This is very
common in modern programming languages. Dis-
criminators are gathered on the executable using
addr2line for each instruction.
Another important point to distinguish instructions
is the handling of inlined functions. When a func-
tion gets inlined, there is a copy of the instructions
of the called function at several places in the binary.
All those diﬀerent copies have the same source loca-
tion debug symbols, so they are not enough to dis-
tinguish them. Fortunately, DWARF symbols include
the inline stack for each inlined function. By storing
the inline stack of each instruction, the proﬁle is very
accurate and provides a correct mapping between bi-
nary instructions and source instructions. These inline
stacks are then processed directly by AFDO. This pro-
cessing proved very important on large C++ projects.
3an instruction is retired when it has been executed and its
result is used
The last point of importance concerning the cre-
ation of the proﬁle is the function name. The name
of a function is generally not enough to identify it
uniquely. When using GCC, a function has two more
names: the mangled (or assembler) name identiﬁes
uniquely a function (take into account all the param-
eters) and the Binary File Descriptor (BFD) name,
used by GCC in the debug symbols to identify func-
tions in inline stack. To identify each source function,
its assembler name is taken from the table of symbols
of the ELF ﬁle.
The present implementation has some limitations.
First of all, the debug symbols are absolutely essential
in order to reconstruct the correct instruction proﬁle.
Indeed, without debug symbols, it would not be pos-
sible to match assembly instructions to their original
source locations. Our implementation works only for
programs that have been compiled in debug mode, but
optimizations can be enabled. Debug executables are
not slower than stripped executables, but they can
be much bigger. This can be overcome by keeping a
stripped copy of the optimized and use it on produc-
tion after PGO. However, the production binary could
not be used for proﬁling.
Moreover, if the compiler does not generate pre-
cise debug symbols, the proﬁle will not be accurate.
This is especially a problem as some optimizations
are not preserving correctly the debug symbols. An-
other problem comes in the form of over or under sam-
pling. For instance, Loop Unrolling may cause the
same statements to be duplicated in diﬀerent basic
blocks. Normalization may then lead to a proﬁle too
low for the generated basic blocks.
It is also highly depending on addr2line and
objdump to gather information about the binary ﬁle.
To have all the features, it is necessary to possess a
very recent version of binutils, at least 2.23.1. More-
over, if there is a bug in either of these tools, the proﬁle
generated could be inaccurate.
Another shortcoming comes from the fact that the
DWARF debugging data format does not support dis-
criminators in inline stacks. It means that the proﬁle
is not completely precise and it can lead the compiler
to the wrong decision, even if this has not been found
to be an issue in practice.
4 Experimental Results
The implementation has been evaluated in terms of
speedups compared to the optimized version and to
the instrumentation-based PGO version. All binaries
were produced using a trunk version of the Google
GCC 4.7 branch. The target was an x86 64 architec-
ture. The processor used for the tests was an IntelR
XeonTM E5-2650, 2 GHz.
Four versions are compared:
•base: The optimized executable, compiled with
-O2 -march=native. The same ﬂags are also
used for the other executables with additional
•instr: The executable trained with
•ucc: The program trained with our im-
plementation in Cycle Accounting mode.
UNHALTED CORE CYCLES is sampled with a period
of 2’000’000 events.
•lbr: The program trained with our implemen-
tation in LBR mode. BRANCH INST RETIRED is
sampled with a period of 400’000 events.
All the results were collected on the SPEC CPU
2006 V1.2 benchmarks4. Each version of the binary is
run three times and SPEC choose the best result out
of the three runs. The variation between the runs is
For run time reasons, a subset of the benchmarks
has been selected. This subset has been chosen to
be representative of diﬀerent programming languages,
both ﬂoating points and integers programs and to rep-
resent diﬀerent ﬁelds of science.
For these tests, a modiﬁed Gooda version was used
to work on small programs. Gooda is especially made
for big programs and so the proﬁle is not generated
for functions below some thresholds. It has been nec-
essary to change the thresholds in order to include as
much functions as possible in the proﬁle.
Figure 2 shows the performance of the diﬀerent
PGO versions. Each version is compared to the per-
formance of the base version.
The results are very interesting. Indeed, LBR
achieves about 75% of the instrumentation-based
PGO gains (arithmetic average of percentages). Cycle
Accounting is less eﬀective, but still achieves 53% of
the gains. This was expected as LBR should be more
accurate than Unhalted Core Cycles.
In some cases, sampling-based PGO even outper-
forms the traditional approach. For instance, on as-
tar benchmarks, LBR achieves 116% to 129% of the
instrumentation gains. This is even more true for
calculix and xalancbmk where instrumentation-based
PGO performs very poorly and sampling achieves
good results. This diﬀerence is not so surprising, as
several optimizations are driven by threshold based
heuristics, so small diﬀerences in the proﬁle can dras-
tically change decisions and lead to better (or worse)
On the contrary, there are also benchmarks where
our implementation performs poorly compared to tra-
ditional PGO. For instance, bwaves proved a very bad
case for our implementation. It comes from the fact
Figure 2: Speedups for SPEC CPU 2006 benchmarks. The application is trained with training data set. Our
implementation achieves 75% of instrumented PGO.
that AutoFDO is not optimized for Fortran code, since
it has been developed with a C/C++ centric approach.
Indeed, special care has been taken to tune inlining
and call optimization passes, while tuning for loop-
intensive code has not been performed.
The ﬁrst results were obtained by training the exe-
cutable on the training data set. To see the impact of
the input data set, the executables were then trained
again on the reference data set. It means that the
same input data set is used for training and for bench-
marking the executable. Figure 3 shows the speedups
obtained when training on the reference data set.
This time, LBR is able to achieve 84% of the instru-
mentation gains. An interesting point about these re-
sults is that where instr and ucc improve their scores
by about 22%, lbr improves by 37%. LBR-sampling
seems to be even more interesting when the input data
is closer to the real data set. Most of the benchmarks
improved only a bit with the reference data set, but
xalancbmk improved by an order of magnitude. It
seems that in its case, the training data set is not
close enough to the reference data set for PGO. cal-
culix seems to be in the same case, although the dif-
ference is not so spectacular.
As AFDO has been especially tuned for C++, Fig-
ure 3b presents the results for C++ benchmarks only.
On C++ benchmarks, both sampling versions are
performing very well. Cycle Accounting achieves 67%
of the instrumentation gains and LBR reaches 93%.
These results are very promising. For each benchmark,
lbr is very close (and sometimes better) to instr.
This section presented some results that can still
be improved, especially for some non-C++ bench-
marks. Once AFDO has improved support for other
languages, like Fortran, it may be interesting to run
these tests again to see how much of the instrumenta-
tion gains sampling-based PGO can reach. It may also
be interesting to investigate the benchmarks where
lbr proved better than instr and see if the result
can be obtained on other benchmarks as well.
4.1 Sampling Overhead
The overhead of our implementation has also
been evaluated and compared to the overhead of
Four versions are compared:
•base: The optimized executable, compiled with
•instr: The GCOV instrumented executable.
•ucc: The program run under the moni-
toring of the proﬁler with sampling on
UNHALTED CORE CYCLES with a period of
•lbr: The program run under the monitoring of
the proﬁler BRANCH INST RETIRED with a period
of 400’000 events.
The sampling periods used for this test have been
chosen by empiric evaluation and chosing the best one,
based on the speedup result and the overhead.
Figure 4 shows the overhead of the diﬀerent ver-
sions, compared to the base version.
The overhead of instrumentation is high, but not as
high as expected, only 16% on average. The highest
overhead of instrumentation was 53% on the povray
As expected, the overhead of sampling is lower than
the overhead of instrumentation. LBR has an average
of 1.06% of overhead, which is 15 times lower than
instrumentation. In several benchmarks the overhead
is less than 1%.
Unfortunately, the overhead of Cycle Counting is
much higher than it should be, indeed, it is as high as
(a) Non-C++ Programs
(b) C++ programs
Figure 3: Speedups for SPEC CPU 2006 benchmarks. The application is trained with reference data set. Our
implementation achieves 84% of instrumented PGO on overall benchmarks and 93% on C++ programs only.
Figure 4: Overheads for SPEC CPU 2006 benchmarks. Our implementation has 15 times less overhead than
10% in average. The problem is that, in this mode,
Gooda proﬁles a large number of events, not only Un-
halted Core Cycles. This adds a large overhead in
terms of performance. At this time, it is not possible
to conﬁgure Gooda to only use Unhalted Core Cycles.
On the other hand, as LBR event proved more accu-
rate and has a small overhead, the value of this mode
Another point of interest is that the variability of
the overhead is much higher for instrumentation than
for sampling. Indeed, the overhead of instrumentation
varies from 0% to 53%, whereas the overhead of LBR-
sampling varies only from 0.3% to 2%. The instru-
mentation overhead highly depends on the code that
is being benchmarked. On the other hand, sampling
overhead depends mostly on the period of sampling
and the type of the sampled events.
Some instrumented executables proved faster than
their optimized counterpart. It may happen in some
cases. After the instrumentation instructions are in-
serted in the program, the program is optimized fur-
ther. These new instructions may change the decisions
that are made during the optimization process. Such
changes in decision may lead to faster code. This sit-
uation cannot happen with sampling.
4.2 Tools Overhead
In the previous section, only the overhead of sampling
(with perf) has been measured. The time necessary
to generate the proﬁle from perf also needs to be taken
There are two diﬀerent tools adding overhead to the
overall process. The ﬁrst one, also the slowest one,
being Gooda. Until now, Gooda has not been tuned
for performances and it may be quite slow for han-
dling very large proﬁles. The conversion from Gooda
spreadsheets to an AFDO proﬁle is not so critical,
since Gooda already ﬁlters several functions. More-
over, it has already been tuned for performances so as
to add the smallest possible overhead to the process.
Both overhead have been tested on several pro-
ﬁles gathered using perf in cycle accouting mode
(UNHALTED CORE CYCLES with a period of 2’000’000
events). The proﬁles have been gathered on GCC
compiling two diﬀerent programs, a toy compiler
and the converter itself. The perf proﬁle are vary-
ing from 167MiB (gcc-google-eddic) and 194KiB
(eddic-list). Each test has been run ﬁve times and
the best result has been taken. The variations between
diﬀerent runs was very low.
Figure 5a presents the overhead of the converter.
As shown in this ﬁgure, the overhead of the converter
is not negligible. It takes a maximum of six seconds
for the test cases. It has to be put in regard of the
running time of the proﬁling. For instance, the test
case gcc-eddic runs during 40 minutes. This makes
an overhead of 0.25%, which is acceptable. An impor-
(a) Converter Overhead
(b) Gooda Overhead
Figure 5: Overhead of the proﬁle generation.
tant point to consider is that it does not scale directly
with the size of the proﬁle but with the number of
functions reported by Gooda, which should grow up
to a maximum related to the size of the proﬁled ex-
ecutable. In the converter, about 65% of the time is
spent in calling addr2line. This could be improved
by integrating address to line conversions directly in
Figure 5b shows the overhead of Gooda, converting
the perf proﬁle to a set of spreadsheets. It is very
clear that the overhead of Gooda is much higher than
the overhead of the converter and is considerable for
at least two test cases (both gcc test cases). On the
slowest test case (gcc-eddic), the overhead is as high
as ﬁve percent, which makes the tool chain much less
interesting. It also takes several seconds for the other
samples even if they are much faster to run under pro-
ﬁling. The overhead is generally getting better with
the running time of the program. In its current state,
the current toolchain is more adapted to long-running
The long running time of Gooda is something that
should really be improved in the future.
5 Related Work
In 2008, Roy Levin, Ilan Newman and Gadi Haber 
proposed a solution to generate edge proﬁles from in-
struction proﬁles of the instruction retired hardware
event for the IBM FDPR-Pro, post-link time opti-
mizer. This solution works on the binary level. The
proﬁle is applied to the corresponding basic blocks af-
ter link-time. The construction of the edge proﬁle
from the sample proﬁle is known as a Minimum Cost
Circulation problem. They showed that this can be
solved in acceptable time for the SPEC benchmarks,
but this remains a heavy algorithm.
Soon after Levin et al., Vinodha Ramasamy, Robert
Hundt, Dehao Chen and Wenguang Chen  pre-
sented another solution of using instruction retired
hardware events to construct an edge proﬁle. This
solution was implemented and tested in the Open64
compiler. Unlike the previous work, the proﬁle is re-
constructed from the binary using source position in-
formation. This has the advantage that the binary can
be built using any compiler and then used by Open64
to perform PGO. They were able to reach an aver-
age of 80% of the gains that can be obtained with
In 2010, Dehao Chen et al.  continued the work
started in Open64 and adapted it for GCC. In this
work, several optimizations of GCC were specially
adapted to the use of sampling proﬁles. The ba-
sic block and edges frequencies are derived using a
Minimum Control Flow algorithm. In this solution,
the Last Branch Record (LBR) precise sampling fea-
ture of the processor was used to improve the accu-
racy of the proﬁle. Moreover, they also used a spe-
cial version of the Lightweight Interprocedural Opti-
mizer (LIPO) of GCC. The value proﬁle is also derived
from the sample proﬁle using PEBS mode. With all
these optimizations put together, they were able to
achieve an average of 92% of the performance gains of
instrumentation-based Feedback Directed Optimiza-
More recently, Dehao Chen (Google) released Aut-
oFDO5(AFDO), on which our solution is based. It is a
patch for GCC to handle sampling-based proﬁles. The
proﬁle is represented by a GCOV ﬁle, containing func-
tion proﬁles. Several optimizations of GCC have been
reviewed to handle more accurately this new kind of
proﬁle. The proﬁle is generated from the debug infor-
mation contained into the executable and the samples
are collected using the perf tool. AutoFDO is espe-
cially made to support optimized binary. For the time
being, AFDO does not handle value proﬁles. Only
the GCC patch has been released so far, no tool to
generate the proﬁle has been released.
More on the side of Performance Counters, Vincent
M. Weaver shown that, when the setup is correctly
tuned, the values of the performance counters have
a very small variation between diﬀerent runs (0.002
percent on the SPEC benchmarks). Nonetheless, very
subtle changes in the setup can result in large varia-
tions in the results.
Other sampling approaches without using perfor-
mance counters have been proposed. For instance,
The Morph system use statistical sampling of the pro-
gram counter to collect proﬁles. In another solu-
tion, kernel instructions were used to sample the con-
tents of the branch-prediction unit of the hardware.
These two solutions requires that additional informa-
tion be encoded into the binary to correlates samples
to the compiler’s Intermediate Representation.
Performance Counters also start to be used in other
areas than Proﬁle-Guided Optimization. For instance,
Schneider et al. sample performance counters to op-
timize data locality in VM with garbage collector.
In this solution, the collected data were used to driven
online optimizations in the Just-In-Time (JIT) com-
piler of the VM.
6 Conclusion and Future Work
We designed and implemented a toolchain to use
Hardware Event sampling to drive Proﬁle-Guided Op-
timization inside GCC. Our implementation proved
to be competitive with instrumentation-based PGO
in terms of performance, achieving 93% of the gains
of traditional PGO and in terms of speed, having 15
times less overhead. The experiments show that this
technique is already usable in production. However,
its high performance is currently limited to C++ pro-
grams. Some work would have to be achieved to ex-
tend the current toolchain to support more program-
ming languages. For that, the most important changes
will need to be done in AFDO.
Instrumentation-based PGO has still an advantage
over our implementation. It can generate value pro-
ﬁles. This kind of features is not yet supported by our
toolchain. However, it has already been implemented
with sampling-based PGO in  and it something that
was currently being developed in AFDO during our
project, it should be integrated in the toolchain itself.
The presented toolchain makes it easy to handle
new events. These events may lead to implementa-
tion of novel optimizations. Of course, sampling more
events also incurs more overhead during proﬁling. Ex-
periments have been made to integrate Load Latency
events into GCC. The problem being that the new
information is hard to use into existing or new opti-
mization techniques. We implemented a Loop Fusion
pass for GCC taking the Load Latency into account
in its decision heuristics. The main diﬃculty with
Load Latency events being that they are not accurate
enough at basic block level. To make better use of
these events, it would be necessary to have a proﬁle
on an instruction level inside the compiler.
Some work will also have to be done to improve the
speed of Gooda for large proﬁles, that is currently too
We want to thank the reviewers for their comments
and corrections regarding this paper. We would like
to thank Stephane Eranian for his help in using the
perf proﬁler and for providing kernel patches for perf
events and libpfm.
The implementation of the proﬁle generator is avail-
able on Github (https://github.com/wichtounet/
The usage of the toolchain is described on the repos-
itory home page.
 Thomas Ball and James R. Larus. Optimally pro-
ﬁling and tracing programs. ACM Transactions
on Programming Languages and Systems, 16:59–
 Thomas Ball and James R. Larus. Eﬃcient path
proﬁling. In In Proceedings of the 29th Annual
International Symposium on Microarchitecture,
pages 46–57, 1996.
 Paolo Calaﬁura, Stephane Eranian, David
Levinthal, Sami Kama, and Roberto Agostino Vi-
tillo. Gooda: The generic optimization data an-
alyzer. Journal of Physics: Conference Series,
 Dehao Chen, Neil Vachharajani, Robert Hundt,
Xinliang Li, Stephane Eranian, Wenguang Chen,
and Weimin Zheng. Taming hardware event sam-
ples for precise and versatile feedback directed op-
timizations. IEEE Transactions on Computers,
 Dehao Chen, Neil Vachharajani, Robert Hundt,
Shih-wei Liao, Vinodha Ramasamy, Paul Yuan,
Wenguang Chen, and Weimin Zheng. Taming
hardware event samples for fdo compilation. In
Proceedings of the 8th annual IEEE/ACM inter-
national symposium on Code generation and opti-
mization, CGO ’10, pages 42–52, New York, NY,
USA, 2010. ACM.
 Thomas M. Conte, Burzin Patel, Kishore N.
Menezes, and J. Stan Cox. Hardware-based pro-
ﬁling: An eﬀective technique for proﬁle-driven op-
 Arnaldo Carvalho de Melo. The new linux-
perftools. In Slides from Linux Kongress, 2010.
 Roy Levin, Ilan Newman, and Gadi Haber. Com-
plementing missing and inaccurate proﬁling us-
ing a minimum cost circulation algorithm. In
Proceedings of the 3rd international conference
on High performance embedded architectures and
compilers, HiPEAC’08, pages 291–304, Berlin,
Heidelberg, 2008. Springer-Verlag.
 David Levinthal. Cycle Accounting Analysis on
CoreTM 2 Processors. Technical report,
Intel Corp., 2008.
 David Levinthal. Performance Analysis Guide for
CoreTM i7 and Intel R
XeonTM 5500 pro-
cessors. Technical report, Intel Corp., 2009.
 Omitted. Omitted. Master’s thesis, Omitted,
 Vinodha Ramasamy, Paul Yuan, Dehao Chen,
and Robert Hundt. Feedback-directed optimiza-
tions in gcc with estimated edge proﬁles from
hardware event sampling. In Proceedings of GCC
Summit 2008, pages 87–102, 2008.
 Florian T. Schneider, Mathias Payer, and
Thomas R. Gross. Online optimizations driven
by hardware performance monitoring. In Pro-
ceedings of the 2007 ACM SIGPLAN conference
on Programming language design and implemen-
tation, PLDI ’07, pages 373–382, New York, NY,
USA, 2007. ACM.
 Vincent M. Weaver and Sally A. McKee. Can
hardware performance counters be trusted? In
IISWC, pages 141–150, 2008.
 Xiaolan Zhang, Zheng Wang, Nicholas Gloy,
J. Bradley Chen, and Michael D. Smith. System
support for automatic proﬁling and optimization.
In Proceedings of the sixteenth ACM symposium
on Operating systems principles, SOSP ’97, pages
15–26, New York, NY, USA, 1997. ACM.