ThesisPDF Available

Cache-Friendly Profile Guided Optimization

Authors:
A preview of the PDF is not available
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
False sharing, which occurs when multiple threads access different data elements on the same cache line, and at least one of them updates the data, is a well known source of performance degradation on cache coherent parallel systems. The application developer is often unaware of this problem during program creation, and it can be hard to detect instances of its occurrence in a large code. In this paper, we present a compile-time cost model for estimating the performance impact of false sharing on parallel loops. Using this model, we are able to predict the amount of false sharing that could occur when the loop is executed, and can indicate the percentage of program execution time that is due to maintaining the coherence of data from false sharing. We evaluated our model by comparing its predictions obtained on several computational kernels using 2 to 48 threads against that from actual execution. The results showed that our model can accurately quantify the impact of false sharing on loop performance at compile-time.
Article
Full-text available
Graphite is the loop transformation framework that was introduced in GCC 4.4. This paper gives a detailed de-scription of the design and future directions of this in-frastructure. Graphite uses the polyhedral model as the internal representation (GPOLY). The plan is to create a polyhedral compilation package (PCP) that will pro-vide loop optimization and analysis capabilities to GCC. This package will be separated from GIMPLE via an in-terface language that is restricted to express only what GPOLY can represent. The interface language is a set of data structures that encodes the control flow and mem-ory accesses of a code region. A syntax for the language is also defined to facilitate debugging and testing.
Conference Paper
Full-text available
When creating architectural tools, it is essential to know whether the generated results make sense. Comparing a toolpsilas outputs against hardware performance counters on an actual machine is a common means of executing a quick sanity check. If the results do not match, this can indicate problems with the tool, unknown interactions with the benchmarks being investigated, or even unexpected behavior of the real hardware. To make future analyses of this type easier, we explore the behavior of the SPEC benchmarks with both dynamic binary instrumentation (DBI) tools and hardware counters. We collect retired instruction performance counter data from the full SPEC CPU 2000 and 2006 benchmark suites on nine different implementations of the times86 architecture. When run with no special preparation, hardware counters have a coefficient of variation of up to 1.07%. After analyzing results in depth, we find that minor changes to the experimental setup reduce observed errors to less than 0.002% for all benchmarks. The fact that subtle changes in how experiments are conducted can largely impact observed results is unexpected, and it is important that researchers using these counters be aware of the issues involved.
Conference Paper
Full-text available
Edge profiling is a very common means for providing feedback on program behavior that can be used statically by an optimizer to produce highly optimized binaries. However collecting full edge profile carries a significant runtime overhead. This overhead creates addition problems for real-time applications, as it may prevent the system from meeting runtime deadlines and thus alter its behavior. In this paper we show how a low overhead sampling technique can be used to collect inaccurate profile which is later used to approximate the full edge profile using a novel technique based on the Minimum Cost Circulation Problem. The outcome is a machine independent profile gathering scheme that creates a slowdown of only 2%-3% during the training set, and produces an optimized binary which is only 0.6% less than a fully optimized one.
Conference Paper
Full-text available
In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality . In this paper, we present compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. We demonstrate that these program transformations are useful for optimizing many programs. To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments with kernels illustrate that our model and algorithm can select and achieve the best performance. For over thirty complete applications, we executed the original and transformed versions and simulated cache hit rates. We collected statistics about the inherent characteristics of these programs and our ability to improve their data locality. To our knowledge, these studies are the first of such breadth and depth. We found performance improvements were difficult to achieve because benchmark programs typically have high hit rates even for small data caches; however, our optimizations significantly improved several programs.
Conference Paper
Full-text available
Feedback-directed optimization (FDO) is effective in improv- ing application runtime performance, but has not been widely adopted due to the tedious dual-compilation model, the dif- ficulties in generating representative training data sets, and the high runtime overhead of profile collection. The use of hardware-event sampling to generate estimated edge pro- files overcomes these drawbacks. Yet, hardware event sam- ples are typically not precise at the instruction or basic-block granularity. These inaccuracies lead to missed performance when compared to instrumentation-based FDO. In this pa- per, we use multiple hardware event profiles and supervised learning techniques to generate heuristics for improved pre- cision of basic-block-level sample profiles, and to further im- prove the smoothing algorithms used to construct edge pro- files. We demonstrate that sampling-based FDO can achieve an average of 78% of the performance gains obtained us- ing instrumentation-based exact edge profiles for SPEC2000 benchmarks, matching or beating instrumentation-based FDO in many cases. The overhead of collection is only 0.74% on average, while compiler based instrumentation incurs 6.8%-53.5% overhead (and 10x overhead on an industrial web search application), and dynamic instrumentation incurs 28.6%-1639.2% overhead.