Conference Paper

On the Efficiency of Transactional Code Generation: A GCC Case Study

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Memory transactions are becoming more popular as chip manufacturers are building native support for their execution. Although current Intel and IBM microprocessors support transactions in their instruction set architectures, there is still room for improvement in the compiler and runtime front. The GNU Compiler Collection (GCC) has language support for transactions, although performance is still a hindrance for its wider use. In this paper we perform an up-to-date study of the GCC transactional code generation and highlight where the main performance losses are coming from. Our study indicates that one of the main source of inefficiency is the read and write barriers inserted by the compiler. Most of this instrumentation is required because the compiler cannot determine, at compile time, whether a region of memory will be accessed concurrently or not. To overcome those limitations, we propose new language constructs that allow programmers to specify which memory locations should be free from instrumentation. Initial experimental results show a good speedup when barriers are elided using our proposed language support compared to the original code generated by GCC.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In order to see that the over-instrumentation problem can happen even with very simple code, consider the example shown in Listing 1 from Honorio et al. [17]. In this example, foo is a function that starts a transaction (line 2) which first creates a linked-list (line 3) and then calls initList, passing the list as an argument (line 4). ...
... Moreover, Zardoshti et al.'s transactional runtime does not follow Intel's TM ABI [19], thus requiring that existing transactional libraries be adapted prior to their use. Honorio et al. [17] propose a pragma-based elision mechanism that requires programmers to insert pragmas at each usage scope of a transactional local value. Our approach using TMFree enforces the correct interplay between local and non-local transactional variables, thus allowing the same flexibility of elidebar [17], but with the additional type enforcement guarantees. ...
... Honorio et al. [17] propose a pragma-based elision mechanism that requires programmers to insert pragmas at each usage scope of a transactional local value. Our approach using TMFree enforces the correct interplay between local and non-local transactional variables, thus allowing the same flexibility of elidebar [17], but with the additional type enforcement guarantees. ...
Article
With chip manufacturers such as Intel, IBM and ARM offering native support for transactional memory in their instruction set architectures, memory transactions are on the verge of being considered a genuine application tool rather than just an interesting research topic. Despite this recent increase in popularity on the hardware side of transactional memory (HTM), software support for transactional memory (STM) is still scarce and the only compiler with transactional support currently available, the GNU Compiler Collection (GCC), does not generate code that achieves desirable performance. For hybrid solutions of TM (HyTM), which are frameworks that leverage the best aspects of HTM and STM, the subpar performance of the software side, caused by inefficient compiler generated code, might forbid HyTM to offer optimal results. This article extends previous work focused exclusively on STM implementations by presenting a detailed analysis of transactional code generated by GCC in the context of HybridTM implementations. In particular, it builds on previous research of transactional memory support in the Clang/LLVM compiler framework, which is decoupled from any TM runtime, and presents the following novel contributions: (a) it shows that STM’s performance overhead, due to an excessive amount of read and write barriers added by the compiler, also impacts the performance of HyTM systems; (b) it reveals the importance of the previously proposed annotation mechanism to reduce the performance gap between HTM and STM in phased runtime systems. Furthermore, it shows that, by correctly using the annotations on just a few lines of code, it is possible to reduce the total number of instrumented barriers by 95% and to achieve speed-ups of up to 7x when compared to the original code generated by GCC and the Clang compiler.
Article
Despite the absence of a formal process and a central command-and-control structure, developer organization in open-source software (OSS) projects is far from being a purely random process. Prior work indicates that, over time, highly successful OSS projects develop a hybrid organizational structure that comprises a hierarchical part and a non-hierarchical part. This suggests that hierarchical organization is not necessarily a global organizing principle and that a fundamentally different principle is at play below the lowest positions in the hierarchy. Given the vast proportion of developers are in the non-hierarchical part, we seek to understand the interplay between these two fundamentally differently organized groups, how this hybrid structure evolves, and the trajectory individual developers take through these structures over the course of their participation. We conducted a longitudinal study of the full histories of 20 popular OSS projects, modeling their organizational structures as networks of developers connected by communication ties and characterizing developers’ positions in terms of hierarchical (sub)structures in these networks. We observed a number of notable trends and patterns in the subject projects: (1) hierarchy is a pervasive structural feature of developer networks of OSS projects; (2) OSS projects tend to form hybrid organizational structures, consisting of a hierarchical and a non-hierarchical part; and (3) the positional trajectory of a developer starts loosely connected in the non-hierarchical part and then tightly integrate into the hierarchical part, which is associated with acquisition of experience (tenure), in addition to coordination and coding activities. Our study (a) provides a methodological basis for further investigations of hierarchy formation, (b) suggests a number of hypotheses on prevalent organizational patterns and trends in OSS projects to be addressed in further work, and (c) may ultimately guide the governance of organizational structures.
Conference Paper
With chip manufacturers such as Intel, IBM and ARM offering native support for transactional memory in their instruction set architectures, memory transactions are on the verge of being considered a genuine application tool rather than just an interesting research topic. Despite this recent increase in popularity on the hardware side of transactional memory (HTM), software support for transactional memory (STM) is still scarce and the only compiler with transactional support currently available, the GNU Compiler Collection (GCC), does not generate code that achieves desirable performance. This paper presents a detailed analysis of transactional code generated by GCC and by a proposed transactional memory support added to the Clang/LLVM compiler framework. Experimental results support the following contributions: (a) STM’s performance overhead is due to an excessive amount of read and write barriers added by the compiler; (b) a new annotation mechanism for the Clang/LLVM compiler framework that aims to overcome the barrier over-instrumentation problem by allowing programmers to specify which variables should be free from transactional instrumentation; (c) a profiling tool that ranks the most accessed memory locations at runtime, working as a guiding tool for programmers to annotate the code. Furthermore, it is revealed that, by correctly using the annotations on just a few lines of code, it is possible to reduce the total number of instrumented barriers by 95% and to achieve speed-ups of up to 7x when compared to the original code generated by GCC and the Clang compiler.
Conference Paper
Full-text available
The addition of transactional memory (TM) support to existing languages provides the opportunity to create new soft- ware from scratch using transactions, and also to simplify or extend legacy code by replacing existing synchronization with language-level transactions. In this paper, we describe our experiences transactionalizing the memcached application through the use of the GCC implementation of the Draft C++ TM Specification. We present experiences and recommendations that we hope will guide the effort to integrate TM into languages, and that may also contribute to the growing collective knowledge about how programmers can begin to exploit TM in existing production-quality software.
Conference Paper
Full-text available
Supporting atomic blocks (e.g., Transactional Memory (TM)) can have far-reaching effects on language design and implementation. While much is known about the language-level semantics of TM and the performance of algorithms for implementing TM, little is known about how platform characteristics affect the manner in which a compiler should instrument code to achieve efficient transactional behavior. We explore the interaction between compiler instrumentation and the performance of transactions. Through evaluation on ARM/Android, SPARC/Solaris, IA32/Linux and IA32/MacOS, we show that the compiler must consider the platform when determining which analyses, transformations, and optimizations to perform. Implementation issues include how TM library code is reached, how per-thread TM metadata is stored and accessed, and how a library switches between modes of operation. We also show that different platforms favor different TM algorithms, through the introduction of a new TM algorithm for the ARM processor. Our findings will affect compiler and TM library designers: to achieve peak performance for transactions, the compiler must perform platform-dependent analysis, transformation, and optimization, and the interface to the TM library must differ according to platform.
Conference Paper
Full-text available
This paper describes an end-to-end system implementation of the transactional memory (TM) programming model on top of the hardware transactional memory (HTM) of the Blue Gene/Q (BG/Q) machine. The TM programming model supports most C/C++ programming constructs on top of a best-effort HTM with the help of a complete software stack including the compiler, the kernel, and the TM runtime. An extensive evaluation of the STAMP benchmarks on BG/Q is the first of its kind in understanding characteristics of running coarse-grained TM workloads on HTMs. The study reveals several interesting insights on the overhead and the scalability of BG/Q HTM with respect to sequential execution, coarse-grain locking, and software TM.
Article
Full-text available
Programmers have traditionally used locks to synchronize concurrent access to shared data. Lock-based synchronization, however, has well-known pitfalls: using locks for fine-grain synchronization and composing code that already uses locks are both difficult and prone to deadlock. Transactional memory provides an alternate concurrency control mechanism that avoids these pitfalls and significantly eases concurrent programming. Transactional memory language constructs have recently been proposed as extensions to existing languages or included in new concurrent language specifications, opening the door for new compiler optimizations that target the overheads of transactional memory.This paper presents compiler and runtime optimizations for transactional memory language constructs. We present a high-performance software transactional memory system (STM) integrated into a managed runtime environment. Our system efficiently implements nested transactions that support both composition of transactions and partial roll back. Our JIT compiler is the first to optimize the overheads of STM, and we show novel techniques for enabling JIT optimizations on STM operations. We measure the performance of our optimizations on a 16-way SMP running multi-threaded transactional workloads. Our results show that these techniques enable transactional memory's performance to compete with that of well-tuned synchronization.
Conference Paper
Full-text available
Transactional Memory (TM) promises to simplify concurrent pro- gramming, which has been notoriously difficult but crucial in real- izing the performance benefit of multi-core processors. Software Transaction Memory (STM), in particular, represents a body of important TM technologies since it provides a mechanism to run transactional programs when hardware TM support is not avail- able, or when hardware TM resources are exhausted. Nonethe- less, most previous studies on STMs were constrained to execut- ing trivial, small-scale workloads. The assumption was that the same techniques applied to small-scale workloads could readily be applied to real-life, large-scale workloads. However, by execut- ing several nontrivial workloads such as particle dynamics simu- lation and game physics engine on a state of the art STM, we no- ticed that this assumption does not hold. Specifically, we identified four major performance bottlenecks that were unique to the case of executing large-scale workloads on an STM: false conflicts, over- instrumentation, privatization-safety cost, and poor amortization. We believe that these bottlenecks would be common for any STM targeting real-world applications. In this paper, we describe those identified bottlenecks in detail, and we propose novel solutions to alleviate the issues. We also thoroughly validate these approaches with experimental results on real machines.
Conference Paper
Full-text available
Abstract ,Transactional memory offers significant advantages for concurrency control compared to locks. This paper presents the design and implementation of transactional memory,constructs in an ,unmanaged ,language. Unmanaged languages pose a unique set of challenges to transactional memory constructs – for example, lack of type and memory safety, use of function pointers, aliasing of local variables, and others. This paper describes novel compiler and runtime mechanisms that address these challenges and optimize the performance of transactions inan,unmanaged ,environment. We have ,implemented these mechanisms in a production-quality C compiler and ahigh-performance software ,transactional memory runtime. We measure ,the ,effectiveness of these optimizations and compare the performance of lock-based versus transaction-based programming ,on a ,set of concurrent data structures and the SPLASH-2 benchmark suite. On a 16 processor SMP system, the transaction- based version of the SPLASH-2 benchmarks scales much better than the coarse-grain locking version and performs comparably to the fine-grain locking version. Compiler
Conference Paper
Full-text available
Atomic blocks allow programmers to delimit sections of code as 'atomic', leaving the language's implementation to enforce atomicity. Existing work has shown how to implement atomic blocks over word-based transactional memory that provides scalable multi-processor performance without requiring changes to the basic structure of objects in the heap. However, these implementations perform poorly because they interpose on all accesses to shared memory in the atomic block, redirecting updates to a thread-private log which must be searched by reads in the block and later reconciled with the heap when leaving the block.This paper takes a four-pronged approach to improving performance: (1) we introduce a new 'direct access' implementation that avoids searching thread-private logs, (2) we develop compiler optimizations to reduce the amount of logging (e.g. when a thread accesses the same data repeatedly in an atomic block), (3) we use runtime filtering to detect duplicate log entries that are missed statically, and (4) we present a series of GC-time techniques to compact the logs generated by long-running atomic blocks.Our implementation supports short-running scalable concurrent benchmarks with less than 50\% overhead over a non-thread-safe baseline. We support long atomic blocks containing millions of shared memory accesses with a 2.5-4.5x slowdown.
Conference Paper
Full-text available
Programmers have traditionally used locks to synchronize concur- rent access to shared data. Lock-based synchronization, however, has well-known pitfalls: using locks for fine-grain synchroniza- tion and composing code that already uses locks are both difficult and prone to deadlock. Transactional memory provides an alter- nate concurrency control mechanism that avoids these pitfalls and significantly eases concurrent programming. Transactional mem- ory language constructs have recently been proposed as extensions to existing languages or included in new concurrent language spec- ifications, opening the door for new compiler optimizations that target the overheads of transactional memory. This paper presents compiler and runtime optimizations for transactional memory language constructs. We present a high- performance software transactional memory system (STM) inte- grated into a managed runtime environment. Our system efficiently implements nested transactions that support both composition of transactions and partial roll back. Our JIT compiler is the first to optimize the overheads of STM, and we show novel techniques for enabling JIT optimizations on STM operations. We measure the performance of our optimizations on a 16-way SMP running multi-threaded transactional workloads. Our results show that these techniques enable transactional memory's performance to compete with that of well-tuned synchronization.
Article
Full-text available
In this paper, we identify transaction-local memory as a major source of overhead from compiler instrumentation in software transactional memory (STM). Transaction-local memory is memory allocated inside a transaction, which cannot escape (i.e., is captured by) the allocating transaction. Accesses to such memory do not require calls to STM memory access functions (i.e., STM barriers). A compiler unaware of that may translate accesses to captured memory into expensive STM barriers. This presents us opportunities to improve STM performance. Our measurements with the STAMP benchmark suite (version 0.9.9) revealed that as many as 60% of the STM barriers generated by our baseline compiler access captured memory, including 90% of the write barriers and 45% of the read barriers. We propose runtime and compiler optimizations to elide STM barriers to captured memory. These techniques can also elide barriers for accesses to thread-local and read-only data. We implemented those optimizations in the Intel C++ STM compiler. Our experiments with the STAMP benchmark suite on a Intel Dunnington system (with 24 cores in a 4-node SMP system) show that these optimizations can improve performance by to 18% at 16 threads.
Conference Paper
Full-text available
A shared data structure is lock-free if its operations do not require mutual exclusion. If one process is interrupted in the middle of an operation, other processes will not be prevented from operating on that object. In highly concurrent systems, lock-free data structures avoid common problems associated with conventional locking techniques, including priority inversion, convoying, and difficulty of avoiding deadlock. This paper introduces transactional memory, a new multiprocessor architecture intended to make lock-free synchronization as efficient (and easy to use) as conventional techniques based on mutual exclusion. Transactional memory allows programmers to define customized read-modify-write operations that apply to multiple, independently-chosen words of memory. It is implemented by straightforward extensions to any multiprocessor cache-coherence protocol. Simulation results show that transactional memory matches or outperforms the best known locking techniques for simple benchmarks, even in the absence of priority inversion, convoying, and deadlock.
Conference Paper
Nos últimos quatro anos, IBM® e Intel® disponibilizaram processadores com suporte para memória transacional. A maioria dos trabalhos avaliaram esses processadores usando as aplicações STAMP e consideraram apenas as causas de cancelamentos das aplicações como um todo. Neste trabalho, apresenta-se uma análise por transação do STAMP e contrasta-se diferentes métricas de desempenho para determinar os motivos fundamentais pelo baixo desempenho do HaswellTM em algumas aplicações. Em resumo, os resultados mostram que uma transação domina o tempo de execução e tem poucas efetivações porque excede a capacidade restrita do processador, ou gera muitos conflitos quando executada em hardware.
Article
Transactional Memory (TM) is a new programming paradigm for both simple concurrent programming and high concurrent performance. Hardware Transactional Memory (HTM) is hardware support for TM-based programming. It has lower overhead than software transactional memory (STM), which is a software-based implementation of TM. There are now four commercial systems, IBM Blue Gene/Q, IBM zEnterprise EC12, Intel Core, and IBM POWER8, offering HTM. Our work is the first to compare the performance of these four HTM systems. We measured the STAMP benchmarks, the most widely used TM benchmarks. We also evaluated the specific features of each HTM system. Our experimental results show that: (1) there is no single HTM system that is more scalable than the others in all of the benchmarks, (2) there are measurable performance differences among the HTM systems in some benchmarks, and (3) each HTM system has its own implementation characteristics that limit its scalability.
Conference Paper
We present the introduction of transactional memory into the next generation IBM System z CPU. We first describe the instruction-set architecture features, including requirements for enterprise-class software RAS. We then describe the implementation in the IBM zEnterprise EC12 (zEC12) microprocessor generation, focusing on how transactional memory can be embedded into the existing cache design and multiprocessor shared-memory infrastructure. We explain practical reasons behind our choices. The zEC12 system is available since September 2012.
Conference Paper
In this paper, we propose a new technique that can identify transaction-local memory (i.e. captured memory), in managed environments, while having a low runtime overhead. We implemented our proposal in a well known STM framework (Deuce) and we tested it in STMBench7 with two different STMs: TL2 and LSA. In both STMs the performance improved significantly (4 times and 2.6 times, respectively). Moreover, running the STAMP benchmarks with our approach shows improvements of 7 times in the best case for the Vacation application.
Conference Paper
Transactional Memory (TM) is emerging as a promising technology to simplify parallel programming. While several TM systems have been proposed in the research literature, we are still missing the tools and workloads necessary to analyze and compare the proposals. Most TM systems have been evaluated using microbenchmarks, which may not be representative of any real-world behavior, or individual applications, which do not stress a wide range of execution scenarios. We introduce the Stanford Transactional Application for Multi-Processing (STAMP), a comprehensive benchmark suite for evaluating TM systems. STAMP includes eight applications and thirty variants of input parameters and data sets in order to represent several application domains and cover a wide range of transactional execution cases (frequent or rare use of transactions, large or small transactions, high or low contention, etc.). Moreover, STAMP is portable across many types of TM systems, including hardware, software, and hybrid systems. In this paper, we provide descriptions and a detailed characterization of the applications in STAMP. We also use the suite to evaluate six different TM systems, identify their shortcomings, and motivate further research on their performance characteristics.
Book
The advent of multicore processors has renewed interest in the idea of incorporating transactions into the programming model used to write parallel programs. This approach, known as transactional memory, offers an alternative, and hopefully better, way to coordinate concurrent threads. The ACI (atomicity, consistency, isolation) properties of transactions provide a foundation to ensure that concurrent reads and writes of shared data do not produce inconsistent or incorrect results. At a higher level, a computation wrapped in a transaction executes atomically-either it completes successfully and commits its result in its entirety or it aborts. In addition, isolation ensures the transaction produces the same result as if no other transactions were executing concurrently. Although transactions are not a parallel programming panacea, they shift much of the burden of synchronizing and coordinating parallel computations from a programmer to a compiler, to a language runtime system, or to hardware. The challenge for the system implementers is to build an efficient transactional memory infrastructure. This book presents an overview of the state of the art in the design and implementation of transactional memory systems, as of early spring 2010.
Article
SUMMARY Software transactional memory (STM) systems are an attractive environment to evaluate optimistic concurrency. We describe our experience of supporting and optimizing an STM system at both the managed runtime and compiler levels. We describe the design policies of our STM system, and the statistics collected by the runtime to identify performance bottlenecks and guide tuning decisions. We present initial work on supporting automatic instrumentation of STM primitives for C/C++ and Java programs in the IBM XL compiler and J9 JVM. We evaluate and discuss the performance of several transactional programs running on our system.
Article
Leveraging the full power of multicore processors demands newtools and new thinking from the software industry.Concurrency has long been touted as the "next big thing" and "theway of the future," but for the past 30 years, mainstream softwaredevelopment has been able to ignore it. Our parallel future hasfinally arrived: new machines will be parallel machines, and thiswill require major changes in the way we develop software. Theintroductory article in this issue ("The Future of Microprocessors"by Kunle Olukotun and Lance Hammond) describes the hardwareimperatives behind this shift in computer architecture fromuniprocessors to multicore processors, also known as CMPs (chipmultiprocessors). (For related analysis, see "The Free Lunch IsOver: A Fundamental Turn Toward Concurrency in Software.")
Intel Transactional Memory Compiler and Runtime Application Binary Interface
  • Intel
Intel (2009). Intel Transactional Memory Compiler and Runtime Application Binary Interface. Intel Corporation, 1.1 edition.
Intel R Architecture Instruction Set Extensions Programming Reference
  • Intel
Intel (2012). Intel R Architecture Instruction Set Extensions Programming Reference. Intel Corporation.
Lowering the overhead of nonblocking software transactional memory
  • V J Marathe
  • M F Spear
  • C Heriot
  • A Acharya
  • D Eisenstat
  • W N Scherer
  • M L Scott
Marathe, V. J., Spear, M. F., Heriot, C., Acharya, A., Eisenstat, D., Scherer, W. N., and Scott, M. L. (2006). Lowering the overhead of nonblocking software transactional memory. In First ACM SIGPLAN Workshop on Languages, Compilers, and Hardware Support for Transactional Computing.
Stamp need not be considered harmful
  • W Ruan
  • Y Liu
  • M Spear
Ruan, W., Liu, Y., and Spear, M. (2014a). Stamp need not be considered harmful. In Ninth ACM SIGPLAN Workshop on Transactional Computing.
Stamp need not be considered harmful
  • ruan