Article

# ATOM: A System for Building Customized Program Analysis Tools

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

## Abstract

ATOM (Analysis Tools with OM) is a single framework for building a wide range of customized program analysis tools. It provides the common infrastructure present in all code-instrumenting tools; this is the difficult and time-consuming part. The user simply defines the tool-specific details in instrumentation and analysis routines. Building a basic block counting tool like Pixie with ATOM requires only a page of code. ATOM, using OM link-time technology, organizes the final executable such that the application program and user's analysis routines run in the same address space. Information is directly passed from the application program to the analysis routines through simple procedure calls instead of inter-process communication or files on disk. ATOM takes care that analysis routines do not interfere with the program's execution, and precise information about the program is presented to the analysis routines at all times. ATOM uses no simulation or interpretation. ATOM has been implemented on the Alpha AXP under OSF/1. It is efficient and has been used to build a diverse set of tools for basic block counting, profiling, dynamic memory recording, instruction and data cache simulation, pipeline simulation, evaluating branch prediction, and instruction scheduling.

## No full-text available

... Binary instrumentation [196,123,134,155,134,29,145,146], on the other hand, works directly at the level of machine code. Analysing and manipulating binaries is difficult and tedious, since, all too often, one has to re-discover the obvious. ...
... Static binary instrumentation tools [123,196,173,67] are less widely used today. As mentioned earlier in this chapter there are some inherently undecidable problems in static binary analysis. ...
Thesis
Debugging, as usually understood, revolves around finding and removing defects in software that prevent it from functioning correctly.That is, when one talks about bugs and debugging one usually means functional bugs and functional debugging.In the context of this thesis, however, we will talk about performance bugs and performance debugging.Meaning we want to find defects that do not cause a program to crash or behave wrongly, but that make it run inefficiently, too slow, or use too many resources.To that end, we have developed tools that analyse and model the performance to help programmers improve their code to get better performance.We propose the following two performance debugging techniques: sensitivity based performance bottleneck analysis and data-dependence profiling driven optimization feedback.Sensitivity Based Performance Bottleneck Analysis:Answering a seemingly trivial question about a program's performance, such as whether it is memory-bound or CPU-bound, can be surprisingly difficult.This is because the CPU and memory are not merely two completely independent resources, but are composed of multiple complex interdependent subsystems.Here a stall of one resource can both mask or aggravate problems with another resource.We present a sensitivity based performance bottleneck analysis that uses high-level performance model implemented in GUS, a fast CPU simulator to pinpoint performance bottlenecks.Our performance model needs a baseline for the expected performance of different operations on a CPU, like the peak IPC and how different instructions compete for processor resources.Unfortunately, this information is seldom published by hardware vendors, such as Intel or AMD.To build our processor model, we have developed a system to reverse-engineer the required information using automatically generated micro-benchmarks.Data-Dependence Driven Polyhedral Optimization Feedback:We have developed MICKEY, a dynamic data-dependence profiler that provides high-level optimization feedback on the applicability and profitability of optimizations missed by the compiler.MICKEY leverages the polyhedral model, a powerful optimization framework for finding sequences of loop transformations to expose data locality and implement both coarse, i.e. thread, and fine-grain, i.e. vector-level, parallelism.Our tool uses dynamic binary instrumentation allowing it to analyze program written in different programming languages or using third-party libraries for which no source code is available.Internally MICKEY uses a polyhedral intermediate representation IR that encodes both the dynamic execution of a program's instructions as well as its data dependencies.The IR not only captures data dependencies across multiple loops but also across, possibly recursive, procedure calls.We have developed an efficient trace compression algorithm, called the folding algorithm, that constructs this polyhedral IR from a program's execution.The folding algorithm also finds strides in memory accesses to predict the possibility and profitability of vectorization.It can scale to real-life applications thanks to a safe, selective over-approximation mechanism for partially irregular data dependencies and iteration spaces.
... A popular alternative to compiler instrumentation is binary instrumentation (e.g., [6,15,37,44,48,51,58,59]), which works by directly modifying the binary executable of the program-undertest to incorporate instrumentation code. To handle the task of modifying a binary, tool writers generally employ a binary-instrumentation framework, such as Pin [44], DynamoRIO [6], Valgrind [48], or DynInst [4,9]. ...
... Other strategies include binary instrumentation[6,15,37,44,48,51,58,59] and asynchronous sampling[10,28,50]. Section 6 compares binary and compiler instrumentation.Proc. ...
Article
Full-text available
The CSI framework provides comprehensive static instrumentation that a compiler can insert into a program-under-test so that dynamic-analysis tools - memory checkers, race detectors, cache simulators, performance profilers, code-coverage analyzers, etc. - can observe and investigate runtime behavior. Heretofore, tools based on compiler instrumentation would each separately modify the compiler to insert their own instrumentation. In contrast, CSI inserts a standard collection of instrumentation hooks into the program-under-test. Each CSI-tool is implemented as a library that defines relevant hooks, and the remaining hooks are "nulled" out and elided during either compile-time or link-time optimization, resulting in instrumented runtimes on par with custom instrumentation. CSI allows many compiler-based tools to be written as simple libraries without modifying the compiler, lowering the bar for the development of dynamic-analysis tools. We have defined a standard API for CSI and modified LLVM to insert CSI hooks into the compiler's internal representation (IR) of the program. The API organizes IR objects - such as functions, basic blocks, and memory accesses - into flat and compact ID spaces, which not only simplifies the building of tools, but surprisingly enables faster maintenance of IR-object data than do traditional hash tables. CSI hooks contain a "property" parameter that allows tools to customize behavior based on static information without introducing overhead. CSI provides "forensic" tables that tools can use to associate IR objects with source-code locations and to relate IR objects to each other. To evaluate the efficacy of CSI, we implemented six demonstration CSI-tools. One of our studies shows that compiling with CSI and linking with the "null" CSI-tool produces a tool-instrumented executable that is as fast as the original uninstrumented code. Another study, using a CSI port of Google's ThreadSanitizer, shows that the CSI-tool rivals the performance of Google's custom compiler-based implementation. All other demonstration CSI tools slow down the execution of the program-under-test by less than 70%.
... The additional instructions are inserted before the program flow is continued. ATOM [128] was a follow up tool on OM [5] by the same authors that relied on OM's transformation scheme, but new code is added in a distinct section between the text and data section, creating a first version of a static minimal-invasive binary transformation style. ...
... The first tool implementing this approach is ATOM [128]. Since ATOM also implements the first version of a static minimal-invasive binary transformation scheme, it can be seen as a mixed approach. ...
Article
Full-text available
Binary rewriting is changing the semantics of a program without having the source code at hand. It is used for diverse purposes, such as emulation (e.g., QEMU), optimization (e.g., DynInst), observation (e.g., Valgrind), and hardening (e.g., Control flow integrity enforcement). This survey gives detailed insight into the development and state-of-the-art in binary rewriting by reviewing 67 publications from 1966 to 2018. Starting from these publications, we provide an in-depth investigation of the challenges and respective solutions to accomplish binary rewriting. Based on our findings, we establish a thorough categorization of binary rewriting approaches with respect to their use-case, applied analysis technique, code-transformation method, and code generation techniques. We contribute a comprehensive mapping between binary rewriting tools, applied techniques, and their domain of application. Our findings emphasize that although much work has been done over the past decades, most of the effort was put into improvements aiming at rewriting general purpose applications but ignoring other challenges like altering throughput-oriented programs or software with real-time requirements, which are often used in the emerging field of the Internet of Things. To the best of our knowledge, our survey is the first comprehensive overview on the complete binary rewriting process.
... In this section, we compare Pemu with these platforms. Note that static binary code instrumentation or rewriting systems, including the first influential link-time instrumentation system ATOM (Srivastava and Eustace, 1994), are not within our scope. At a high level, these dynamic binary instrumentation platforms can be classified into (1) machine simulator, emulator, and virtualizer, (2) process level instrumentation framework, ...
... Wala [8] also provides static instrumentation library for Java bytecode. Atom [9] is another static binary instrumentation framework on the Alpha processor platform for the Tru-64 OS. PEBIL [10] is a static binary instrumentation framework for x86-64 architecture. ...
... To achieve this, the technique collects LLVM IR BB execution counts. The technique is target-agnostic in the sense that the instrumentation code is inserted at the architecture-independent LLVM IR level and does not require the modification of a program's object code to insert instrumentation instructions, unlike in other approaches [21]. The energy consumption estimations are also collected at the LLVM IR level. ...
Article
Full-text available
Energy transparency is a concept that makes a program's energy consumption visible from hardware up to software, through the different system layers. Such transparency can enable energy optimizations at each layer and between layers, and help both programmers and operating systems make energy-aware decisions. In this paper, we focus on deeply embedded devices, typically used for Internet of Things (IoT) applications, and demonstrate how to enable energy transparency through existing Static Resource Analysis (SRA) techniques and a new target-agnostic profiling technique, without the need of hardware energy measurements. A novel mapping technique enables software energy consumption estimations at a higher level than the Instruction Set Architecture (ISA), namely the LLVM Intermediate Representation (IR) level, and therefore introduces energy transparency directly to the LLVM optimizer. We apply our energy estimation techniques to a comprehensive set of benchmarks, including single-threaded and also multi-threaded embedded programs from two commonly used concurrency patterns, task farms and pipelines. Using SRA, our LLVM IR results demonstrate a high accuracy with a deviation in the range of 1% from the ISA SRA. Our profiling technique captures the actual energy consumption at the LLVM IR level with an average error of less than 3%.
... In this section, we compare PEMU with these platforms. Note that static binary code instrumentation or rewriting systems, including the first influential link-time instrumentation system ATOM [40], are not within our scope. At a high level, these dynamic binary instrumentation platforms can be classified into (1) machine simulator, emulator, and virtualizer, (2) process level instrumentation framework, and (3) system wide instrumentation framework. ...
Conference Paper
Over the past 20 years, we have witnessed a widespread adoption of dynamic binary instrumentation (DBI) for numerous program analyses and security applications including program debugging, profiling, reverse engineering, and malware analysis. To date, there are many DBI platforms, and the most popular one is Pin, which provides various instrumentation APIs for process instrumentation. However, Pin does not support the instrumentation of OS kernels. In addition, the execution of the instrumentation and analysis routine is always inside the virtual machine (VM). Consequently, it cannot support any out-of-VM introspection that requires strong isolation. Therefore, this paper presents PEMU, a new open source DBI framework that is compatible with Pin-APIs, but supports out-of-VM introspection for both user level processes and OS kernels. Unlike in-VM instrumentation in which there is no semantic gap, for out-of-VM introspection we have to bridge the semantic gap and provide abstractions (i.e., APIs) for programmers. One important feature of PEMU is its API compatibility with Pin. As such, many Pin plugins are able to execute atop PEMU without any source code modification. We have implemented PEMU, and our experimental results with the SPEC 2006 benchmarks show that PEMU introduces reasonable overhead.
... Therefore, researchers and practitioners have proposed various tools that can help in discovering vulnerabilities automatically. Classical vulnerability detection tools rely on static [10,3,17] or dynamic [22,20,24] code analysis, symbolic execution or taint analysis. However, with the advent of efficient machine learning techniques, new approaches appear that try to solve Software Engineering (SE) problems by training AI prediction models on large amount of annotated code samples. ...
Preprint
Full-text available
In the age of big data and machine learning, at a time when the techniques and methods of software development are evolving rapidly, a problem has arisen: programmers can no longer detect all the security flaws and vulnerabilities in their code manually. To overcome this problem, developers can now rely on automatic techniques, like machine learning based prediction models, to detect such issues. An inherent property of such approaches is that they work with numeric vectors (i.e., feature vectors) as inputs. Therefore, one needs to transform the source code into such feature vectors, often referred to as code embedding. A popular approach for code embedding is to adapt natural language processing techniques, like text representation, to automatically derive the necessary features from the source code. However, the suitability and comparison of different text representation techniques for solving Software Engineering (SE) problems is rarely studied systematically. In this paper, we present a comparative study on three popular text representation methods, word2vec, fastText, and BERT applied to the SE task of detecting vulnerabilities in Python code. Using a data mining approach, we collected a large volume of Python source code in both vulnerable and fixed forms that we embedded with word2vec, fastText, and BERT to vectors and used a Long Short-Term Memory network to train on them. Using the same LSTM architecture, we could compare the efficiency of the different embeddings in deriving meaningful feature vectors. Our findings show that all the text representation methods are suitable for code representation in this particular task, but the BERT model is the most promising as it is the least time consuming and the LSTM model based on it achieved the best overall accuracy(93.8%) in predicting Python source code vulnerabilities.
... Valgrind is a popular tool that has been applied to profiling and debugging x86 programs [19]. Valgrind is a basic-block interpreter/dynamic compiler with an instrumentation interface similar to those supplied by static rewriting packages like Atom [20], EEL [15], and Etch [17]. Noninteractive (and we suppose interactive) debugging features can be implemented in Valgrind by registering the appropriate instrumentation functions. ...
Article
Breakpoints, watchpoints, and conditional variants of both are essential debugging primitives, but their natural implementations often degrade performance significantly. Slowdown arises because the debugger—the tool implementing the breakpoint/watchpoint interface—is implemented in a process separate from the debugged application. Since the debugger evaluates the watchpoint expressions and conditional predicates to determine whether to invoke the user, a debugging session typically requires many expensive application-debugger context switches, resulting in slowdowns of 40,000 times or more in current commercial and open-source debuggers! In this paper, we present an effective and efficient implementation of (conditional) breakpoints and watchpoints that uses DISE to dynamically embed debugger logic into the running application. DISE (dynamic instruction stream editing) is a previously-proposed, programmable hardware facility for dynamically customizing applications by transforming the instruction stream as it is decoded. DISE embedding preserves the logical separation of application and debugger—instructions are added dynamically and transparently, existing application code and data are not statically modified—and has little startup cost. Cycle-level simulation on the SPEC 2000 integer benchmarks shows that the DISE approach eliminates all unnecessary context switching, typically limits debugging overhead to 25% or less for a wide range of watchpoints, and outperforms alternative implementations.
... Rewriting binary code can be dated back to the late 1960s [16], when it was first used for flexible performance measurement. Later in the late 1980s and early 1990s, rewriting binaries for RISC architectures (e.g., Alpha, SPARC, and MIPS), in which code is well-aligned and instructions have a fixed length, became quite popular in applications such as instrumentation (e.g., PIXIE [12], ATOM [38]), performance measurement and optimization (e.g., QPT [23] and EEL [24]), and architecture translation [35]. In contrast to RISC architectures, rewriting binaries for CISC such as x86 is much more challenging. ...
... Rewriting binary code can be dated back to the late 1960s [16], when it was first used for flexible performance measurement. Later in the late 1980s and early 1990s, rewriting binaries for RISC architectures (e.g., Alpha, SPARC, and MIPS), in which code is well-aligned and instructions have a fixed length, became quite popular in applications such as instrumentation (e.g., PIXIE [12], ATOM [38]), performance measurement and optimization (e.g., QPT [23] and EEL [24]), and architecture translation [35]. In contrast to RISC architectures, rewriting binaries for CISC such as x86 is much more challenging. ...
Conference Paper
Full-text available
Static binary rewriting is a core technology for many systems and security applications, including profiling, optimization, and software fault isolation. While many static binary rewriters have been developed over the past few decades, most make various assumptions about the binary, such as requiring correct disassembly, cooperation from compilers, or access to debugging symbols or relocation entries. This paper presents Multiverse, a new binary rewriter that is able to rewrite Intel CISC binaries without these assumptions. Two fundamental techniques are developed to achieve this: (1) a superset disassembly that completely disassembles the binary code into a superset of instructions in which all legal instructions fall, and 92) an instruction rewriter that is able to relocate all instructions to any other location by mediating all indirect control flow transfers and redirecting them to the correct new addresses. A prototype implementation of Multiverse and evaluation on SPECint 2006 benchmarks shows that Multiverse is able to rewrite all of the testing binaries with a reasonable overhead for the new rewritten binaries. Simple static instrumentation using Multiverse and its comparison with dynamic instrumentation shows that the approach achieves better average performance. Finally, the security applications of Multiverse are exhibited by using it to implement a shadow stack.
... In a SASSI handler, the developer directs SASSI on which type of instruction to instrument, and where (either before or after the instruction) to add the instrumentation code. SASSI is modeled after the Intel Pin toolset [15], which was modeled after the Digital Equipment Corporation ATOM toolset [16]. ...
... Some instrumentation tools are also capable of inserting instrumentation points to binary executables, either statically or dynamically. QPT [19], EEL [20], and ATOM [34] are examples of static binary instrumentation tools. Static instrumentation is based on static analysis and, hence, cannot react to application changes at run time. ...
Article
Full-text available
Software tracing techniques are well-established and used by instrumentation tools to extract run-time information for program analysis and debugging. Dynamic binary instrumentation as one tool instruments program binaries to extract information. Unfortunately, instrumentation causes perturbation that is unacceptable for time-sensitive applications. Consequently we developed DIME*, a tool for dynamic binary instrumentation that considers timing constraints. DIME* uses Pin and a rate-based server approach to extract information only as long as user-specified constraints are maintained. Due to the large amount of redundancies in program traces, DIME* reduces the instrumentation overhead by one to three orders of magnitude compared to native Pin while extracting up to 99% of the information. We instrument VLC and PostgreSQL to demonstrate the usability of DIME*.
... Early research on binary instrumentation was studied for non-security purposes such as optimization, performance measurement, and profiling. These techniques (e.g., PIXE [45], ATOM [46], QPT [47], and EEL [48]) are based on RISC architecture (e.g., SPARC and MIPS). Because the ARM architecture is also a RISC, these studies are highly relevant to our study. ...
Article
Binary rewriting techniques are widely used in program vulnerability fixing, obfuscation, security-oriented transforming, and otherpurposes, such as binary profiling and optimization. Over the past decade, most binary instrumentation techniques have been studied on $\times 86$ architecture, specifically focusing onthe challenges of instrumenting non-PIC. In contrast, ARM architecture has received little attention, and statically instrumenting PIC has not been studied in depth. In ARM, owing to its fixed-length instructions, addresses are frequently computed via multiple stages, making it difficult to handle all relative addresses, especially the relative address of base-plus-offset and base-plus-index addressing . In this paper, we present REPICA , a static binary instrumentation technique which can rewrite ARM binaries compiled in a position-independent fashion. REPICA can instrument at anywhere without symbolic information. With the aim of identifying andprocessing relative-addresses accurately, we designed a value-set analysis specialized for PIC of which the domain is in symbolic format. We also identified a new challenge for situations all relative addresses cannot be corrected in an optimized way and solvedthis problem efficiently by the stepwise correction of each relative address. We implemented a prototype of REPICA and experimented with approximately 1200 COTS binaries and SPECint2006 benchmarks. The experiment showed that all binaries rewritten by REPICA maintain relative addresses correctly with negligible execution and space overhead. Finally, we exhibit the effectiveness of REPICA by using it to implement a shadow stack.
... Wala [10] also provides static instrumentation library for Java bytecode. Atom [11] is another static binary instrumentation framework on the Alpha processor platform for the Tru-64 OS. PEBIL [12] is a static binary instrumentation framework for x86-64 architecture. ...
Preprint
Full-text available
Most of hardware-assisted solutions for software security, program monitoring, and event-checking approaches require instrumentation of the target software, an operation which can be performed using an SBI (Static Binary Instrumentation) or a DBI (Dynamic Binary Instrumentation) framework. Hardware-assisted instrumentation can use one of these two solutions to instrument data to a memory-mapped register. Both these approaches require an in-depth knowledge of frameworks and an important amount of software modifications in order to instrument a whole application. This work proposes a novel way to instrument an application with minor modifications, at the source code level, taking advantage of underlying hardware debug components such as CS (CoreSight) components available on Xilinx Zynq SoCs. As an example, the instrumentation approach proposed in this work is used to detect a double free security attack. Furthermore, it is evaluated in terms of runtime and area overhead. Results show that the proposed solution takes 30 $\mu$s on average to instrument an instruction while the optimized version only takes 0.014 us which is ten times better than usual memory-mapped register solutions used in existing works.
... This approach is commonly known as Static Binary Instrumentation (SBI), and sometimes referred to as binary rewriting. While techniques for specific tasks such as collecting instruction traces had already been described three decades ago [16], it is only with the ATOM [55] framework that a general SBI-based approach to tool building was proposed. ATOM provided selective instrumentation capabilities, with information being passed from the application program to analysis routines via procedure calls. ...
Conference Paper
Full-text available
Dynamic binary instrumentation (DBI) techniques allow for monitoring and possibly altering the execution of a running program up to the instruction level granularity. The ease of use and flexibility of DBI primitives has made them popular in a large body of research in different domains, including software security. Lately, the suitability of DBI for security has been questioned in light of transparency concerns from artifacts that popular frameworks introduce in the execution: while they do not perturb benign programs, a dedicated adversary may detect their presence and defeat the analysis. The contributions we provide are two-fold. We first present the abstraction and inner workings of DBI frameworks, how DBI assisted prominent security research works, and alternative solutions. We then dive into the DBI evasion and escape problems, discussing attack surfaces, transparency concerns, and possible mitigations. We make available to the community a library of detection patterns and stopgap measures that could be of interest to DBI users.
... Ratanaworabhan et al. [84] also used basic blocks to classify single-threaded application phases. The authors used ATOM [102], an application analysis tool, to identify basic blocks and assign each basic block a unique identification number. During phase classification, the authors identified a phase change when a significant number of new basic blocks executed in a short period of time. ...
Preprint
Adaptable computing is an increasingly important paradigm that specializes system resources to variable application requirements, environmental conditions, or user requirements. Adapting computing resources to variable application requirements (or application phases) is otherwise known as phase-based optimization. Phase-based optimization takes advantage of application phases, or execution intervals of an application, that behave similarly, to enable effective and beneficial adaptability. In order for phase-based optimization to be effective, the phases must first be classified to determine when application phases begin and end, and ensure that system resources are accurately specialized. In this paper, we present a survey of phase classification techniques that have been proposed to exploit the advantages of adaptable computing through phase-based optimization. We focus on recent techniques and classify these techniques with respect to several factors in order to highlight their similarities and differences. We divide the techniques by their major defining characteristics---online/offline and serial/parallel. In addition, we discuss other characteristics such as prediction and detection techniques, the characteristics used for prediction, interval type, etc. We also identify gaps in the state-of-the-art and discuss future research directions to enable and fully exploit the benefits of adaptable computing.
... ASTlog [8] is a tool for simple pattern matching in parse trees. Tools such as ATOM [9] provide support for binary level translation. Purify [10] performs binary level translation of generic programs for inserting bound checking. ...
Article
Full-text available
In this paper, we discuss possible uses of static analysis to facilitate runtime checking. In particular, we focus on two categories of uses: static analysis for helping with runtime bounds checking and the general case of using static analysis for helping with code instrumentation.
Conference Paper
Caches are configured at design time in such a way so as to produce a low average memory access time across a wide range of applications. This may result in a poor performance, because effectively, no single specific cache architecture can complement with the application's innate requirements over a wide range of applications. Also an application follows a unique sequence of patterns during its execution. These patterns called phases repeats themselves periodically. Some phases are stable and extend for billions of instructions. These stable phases constitute the majority of the applications' execution. In this paper, a novel reconfigurable architecture is designed for caches which dynamically discerns these phases and configures the caches based on the throughput. For the reconfiguration of the caches, we have proposed the change in its associativity while keeping the size constant. This change in associativity is made for the first level cache and is concluded that a decrease in associativity at the first level cache does not result in any increase in the cache misses at the second level cache. For this, we delineate the behavior of the level one cache misses which occurs because of the change in the associativity and confine these misses well within the second level cache. As a consequence we have successfully come up with a simple equation which arbitrates the threshold value for the cache reconfiguration decision. This decision for the threshold value is a contingent on the actual running time of the application and hence does not need to be trained for each application separately.
Article
Generating an accurate estimate of the performance of a program on a given system is important to a large number of people. Computer architects, compiler writers, and developers all need insight into a machine's performance. There are a number of performance estimation techniques in use, from profile-based approaches to full machine simulation. This paper discusses a profile-based performance estimation technique that uses a lightweight instrumentation phase that runs in order number of dynamic instructions, followed by an analysis phase that runs in roughly order number of static instructions. This technique accurately predicts the performance of the core pipeline of a detailed out-of-order issue processor model while scheduling far fewer instructions than does full simulation. The difference between the predicted execution time and the time obtained from full simulation is only a few percent.
Conference Paper
With the growing processor-memory performance gap, understanding and optimizing a program's reference locality, and consequently, its cache performance, is becoming increasingly important. Unfortunately, current reference locality optimizations rely on heuristics and are fairly ad-hoc. In addition, while optimization technology for improving instruction cache performance is fairly mature (though heuristic-based), data cache optimizations are still at an early stage. We believe the primary reason for this imbalance is the lack of a suitable representation of a program's dynamic data reference behavior and a quantitative basis for understanding this behavior. We address these issues by proposing a quantitative basis for understanding and optimizing reference locality, and by describing efficient data reference representations and an exploitable locality abstraction that support this framework. Our data reference representations (Whole Program Streams and Stream Flow Graphs) are compact - two to four orders of magnitude smaller than the program's data reference trace - and permit efficient analysis - on the order of seconds to a few minutes - even for complex applications. These representations can be used to efficiently compute our exploitable locality abstraction (hot data streams). We demonstrate that these representations and our hot data stream abstraction are useful for quantifying and exploiting data reference locality. We applied our framework to several SPECint 2000 benchmarks, a graphics program, and a commercial Microsoft database application. The results suggest significant opportunity for hot data stream-based locality optimizations.
Conference Paper
Full-text available
With the advent of technology, multi-core architectures are prevalent in embedded, general-purpose as well as high-performance computing. Efficient utilization of these platforms in an architecture agnostic way is an extremely challenging task. Hence, profiling tools are essential for programmers to optimize the applications for these architectures and understand the bottlenecks. Typical bottlenecks are irregular memory-access patterns and data-communication among cores which may reduce anticipated performance improvement. In this study, we first survey the memory-access optimization profilers. Thereafter, we provide a detailed comparison of data-communication profilers and highlight their strong and weak aspects. Finally, recommendations for improving existing data-communication profilers and/or designing future ones are thoroughly discussed.
Chapter
With this chapter we begin our discussion of the functional correctness solutions that can be pursued past the release of a new microprocessor design, when the device is already shipped and installed in an end-customer’s system. Error detection at this stage of the processor’s life cycle entails monitoring its behavior dynamically, by observing the internal state of the device with dedicated hardware components residing on the silicon die. In addition to error detection, runtime validation solutions, also called in-the-field solutions, must include an effective recovery and error bypass algorithm, to ensure minimal performance loss and forward progress of the system even in presence of bugs. To make the case for dynamic validation in this chapter, we first discuss the type, criticality and number of escaped design errors reported in several processors products.We then overview two major classes of runtime solutions: checker-based approaches and patching-based ones: themain difference between these two techniques lies in the underlying error detection mechanism. Checker-based solutions focus on verifying high-level system invariants, usually specified at design-time and then mapped to dedicated hardware components. Patching techniques address bugs of which the manufacturer becomes aware after product release and provide programmablemeans to describe these bugs so that the system can later identify their occurrence at runtime. We then contrast these two frameworks in terms of error coverage, usage flow and performance overhead, and present in detail some of the most popular academic and industrial solutions known today for each of the two classes.
Conference Paper
Knowing the provenance of a data item helps in ascertaining its trustworthiness. Various approaches have been proposed to track or infer data provenance. However, these approaches either treat an executing program as a black-box, limiting the fidelity of the captured provenance, or require developers to modify the program to make it provenance-aware. In this paper, we introduce DataTracker, a new approach to capturing data provenance based on taint tracking, a technique widely used in the security and reverse engineering fields. Our system is able to identify data provenance relations through dynamic instrumentation of unmodified binaries, without requiring access to, or knowledge of, their source code. Hence, we can track provenance for a variety of well-known applications. Because DataTracker looks inside the executing program, it captures high-fidelity and accurate data provenance.
Chapter
Many data layout techniques for cache optimization reduce data cache miss rates significantly while only marginally improving run time. This chapter suggests a systematic approach to array merging, a simple but highly effective optimization with a beneficial effect on the memory hierarchy. The run time trade-off can be kept small while improving on cache and particularly on misses in the translation look-aside buffer (TLB). One of the SPEC95 benchmarks is analyzed in detail, with encouraging experimental results.
Chapter
In this paper, we describe the implementation of memory checking functionality that is based on instrumentation tools. The combination of instrumentation based checking functions and the MPI-implementation offers superior debugging functionalities, for errors that otherwise are not possible to detect with comparable MPI-debugging tools. Our implementation contains three parts: first, a memory callback extension that is implemented on top of the ValgrindMemcheck tool for advanced memory checking in parallel applications; second, a new instrumentation tool was developed based on the Intel Pin framework, which provides similar functionality as Memcheck. It can be used in Windows environments that have no access to the Valgrind suite; third, all the checking functionalities are integrated as the so-called memchecker framework within Open MPI. This will also allow other memory debuggers that offer a similar API to be integrated. The tight control of the user’s memory passed to Open MPI, allows us to detect application errors and to track bugs within Open MPI itself. The extension of the callback mechanism targets communication buffer checks in both pre- and post-communication phases, in order to analyze the usage of the received data, e.g. whether the received data has been overwritten before it is used in an computation or whether the data is never used. We describe our actual checks, how memory buffers are being handled internally, show errors actually found in user’s code, and the performance improvement of our instrumentation.
Chapter
There are different means to measure the computational complexity of algorithms For fast motion estimation algorithms, most of the complexity analysis results presented in literature are based on the average number of search points per macro-block. However, with this simple method of using the number of search points, the computational and the memory bandwidth requirements of the entire algorithm (which includes e.g. pel addressing, pel access, decision calculations, filtering, etc.) are not taken into account.
Chapter
Researchers have successfully implemented machine learning classifiers to predict bugs in a change file for years. Change classification focuses on determining if a new software change is clean or buggy. In the literature, several bug prediction methods at change level have been proposed to improve software reliability. This paper proposes a model for classification-based bug prediction model. Four supervised machine learning classifiers (Support Vector Machine, Decision Tree, Random Forrest, and Naive Bayes) are applied to predict the bugs in software changes, and performance of these four classifiers are characterized. We considered a public dataset and downloaded the corresponding source code and its metrics. Thereafter, we produced new software metrics by analyzing source code at class level and unified these metrics with the existing set. We obtained new dataset to apply machine learning algorithms and compared the bug prediction accuracy of the newly defined metrics. Results showed that our merged dataset is practical for bug prediction based experiments.
Article
Full-text available
With the rapid development of electronic and information technology, Internet of Things (IoT) devices have become extensively utilised in various fields. Increasing attention has been paid to the performance and security analysis of IoT-based services. Dynamic instrumentation is a common process in software analysis for acquiring runtime information. However, due to the limited software and hardware resources in IoT devices, most dynamic instrumentation tools do not support IoT-based services. In this paper, we provide an analysis tool, IoTDIT, to solve the current problem of runtime detection in IoT-based services. IoTDIT employs static analysis and ptrace system calls to obtain dynamic firmware information, which can aid in firmware performance analysis and security detection. We perform experiments to verify the performance and effectiveness of the proposed instrumentation tool.
Conference Paper
Full-text available
Fuzz testing techniques are becoming pervasive for their ever-improving ability to generate crashing trial cases for programs. Memory safety violations however can lead to silent corruptions and errors, and a fuzzer may recognize them only in the presence of sanitization machinery. For closed-source software combining sanitization with fuzzing incurs practical obstacles that we try to tackle with an architecture-independent proposal called QASan for detecting heap memory violations. In our tests QASan is competitive with standalone sanitizers and adds a moderate 1.61x average slowdown to the AFL++ fuzzer while enabling it to reveal more heap-related bugs.
Chapter
With the rapid development of electronic and information technology, IoT devices have become widely used in various fields. Increasing attention has been paid to the performance and security analysis of IoT devices. Dynamic instrumentation is a common process in software analysis for acquiring runtime information. However, due to the limited software and hardware resources in IoT devices, most dynamic instrumentation tools do not support IoT devices. In this paper, we provide an analysis tool, IoTDIT, to solve the current problem of runtime detection in IoT devices. IoTDIT uses static analysis and ptrace system calls to obtain dynamic firmware information, which can aid in firmware performance analysis and security detection. We perform experiments to verify the performance and effectiveness of the proposed instrumentation tool.
Conference Paper
Binary instrumentation frameworks are widely used to implement profilers, performance evaluation, error checking, and bug detection tools. While dynamic binary instrumentation tools such as PIN and DynamoRio are supported on CPUs, GPU architectures currently only have limited support for similar capabilities through static compile-time tools, which prohibits instrumentation of dynamically loaded libraries that are foundations for modern high-performance applications. This work presents NVBit, a fast, dynamic, and portable, binary instrumentation framework, that allows users to write instrumentation tools in CUDA/C/C++ and selectively apply that functionality to pre-compiled binaries and libraries executing on NVIDIA GPUs. Using dynamic recompilation at the SASS level, NVBit analyzes GPU kernel register requirements to generate efficient ABI compliant instrumented code without requiring the tool developer to have detailed knowledge of the underlying GPU architecture. NVBit allows basic-block instrumentation, multiple function injections to the same location, inspection of all ISA visible state, dynamic selection of instrumented or uninstrumented code, permanent modification of register state, source code correlation, and instruction removal. NVBit supports all recent NVIDIA GPU architecture families including Kepler, Maxwell, Pascal and Volta and works on any pre-compiled CUDA, OpenACC, OpenCL, or CUDA-Fortran application.
Article
Adaptable computing is an increasingly important paradigm that specializes system resources to variable application requirements, environmental conditions, or user requirements. Adapting computing resources to variable application requirements (or application phases) is otherwise known as phase-based optimization. Phase-based optimization takes advantage of application phases, or execution intervals of an application that behave similarly, to enable effective and beneficial adaptability. In order for phase-based optimization to be effective, the phases must first be classified to determine when application phases begin and end, and ensure that system resources are accurately specialized. In this paper, we present a survey of phase classification techniques that have been proposed to exploit the advantages of adaptable computing through phase-based optimization. We focus on recent techniques and classify these techniques with respect to several factors in order to highlight their similarities and differences. We divide the techniques by their major defining characteristics---online/offline and serial/parallel. In addition, we discuss other characteristics such as prediction and detection techniques, the characteristics used for prediction, interval type, etc. We also identify gaps in the state-of-the-art and discuss future research directions to enable and fully exploit the benefits of adaptable computing.
Conference Paper
Programmers depend on virtual platforms, such as Android-QEMU, to build and test their applications as well as system software for the ease of development efforts. It is easy to add tracing modules to the virtual platforms to dump the execution trace of the guest system, which can then be used to estimate and evaluate the performance of alternative system designs. However, tracing in virtual platforms are performed at the architecture level, which makes it difficult to generate traces for specific applications due to the lack of high-level software information. This problem becomes even more challenging for Android systems, which use component-based design strategies, where applications request services from other components. In this paper, we propose a novel Android-QEMU tracing system, which follows the invocations among the service components, generates component-service-aware traces only for specific applications. The evaluation results show that the proposed system improves 152% in simulation time and saves 33% of storage space in average compared to the static QEMU-Tracer.
Article
Memory error debugging is one of the most critical processes in improving software quality. However, due to the extensive time consumed to debug, the enhancement often leads to a huge bottle neck in the development process of mobile devices. Most of the existing memory error detection tools are based on static error detection; however, the tools cannot be used in mobile devices due to their use of large working memory. Therefore, it is challenging for mobile device vendors to deliver high quality mobile devices to the market in time. In this paper, we introduce "PinMemcheck", a pin-based memory error detection tool, which detects all potential memory errors within 1.5{\times} execution time overhead compared with that of a baseline configuration by applying the Pin`s binary instrumentation process and a simple data structure.
Article
Full-text available
Function call interception (FCI), or method call interception (MCI) in the object-oriented programming domain, is a technique of intercepting function calls at program runtime. Without directly modifying the original code, FCI enables to undertake certain operations before and/or after the called function or even to replace the intercepted call. Thanks to this capability, FCI has been typically used to profile programs, where functions of interest are dynamically intercepted by instrumentation code so that the execution control is transferred to an external module that performs execution time measurement or logging operations. In addition, FCI allows for manipulating the runtime behavior of program components at the fine-grained function level, which can be useful in changing an application's original behavior at runtime to meet certain execution requirements such as maintaining performance characteristics for different input problem sets. Due to this capability, however, some FCI techniques can be used as a basis of many security exploits for vulnerable systems. In this paper, we survey a variety of FCI techniques and tools along with their applications to diverse areas in the computing and software domains. We describe static and dynamic FCI techniques at large and discuss the strengths and weaknesses of different implementations in this category. In addition, we also discuss aspect-oriented programming implementation techniques for intercepting method calls.
Chapter
The computing industry is changing rapidly, pushing strongly to consolidation into large cloud computing datacenters. New power, availability, and cost constraints require installations that are better optimized for their intended use. The problem of right-sizing large datacenters requires tools that can characterize both the target workloads and the hardware architecture space. Together with the resurgence of variety in industry standard CPUs, driven by very ambitious multi-core roadmaps, this is making the existing modeling techniques obsolete. In this chapter we revisit the basic computer architecture simulation concepts toward enabling fast and reliable datacenter simulation. Speed, full system, and modularity are the fundamental characteristics of a datacenter-level simulator. Dynamically trading off speed/accuracy, running an unmodified software stack, and leveraging existing component simulators are some of the key aspects that should drive next generation simulator's design. As a case study, we introduce the COTSon simulation infrastructure, a scalable full-system simulator developed by HP Labs and AMD, targeting fast and accurate evaluation of current and future computing systems.
Conference Paper
Full-text available
PROTEUS is a high-performance simulator for MIMD multiprocessors. It is fast, accurate, and flexible: it is one to two orders of magnitude faster than comparable simulators, it can reproduce results from real multiprocessors, and it is easily configured to simulate a wide range of architectures...
Article
An abstract is not available.
Conference Paper
In trace-driven simulation, traces generated for one set of system characteristics are used to simulate a system with different characteristics. However, the execution path of a multiprocessor workload may depend on the order of events occurring on different processing elements. The event order, in turn, depends on system charcteristics such as memory-system latencies and buffer-sizes. Trace-driven simulations of multiprocessor workloads are inaccurate unless the dependencies are eliminated from the traces.We have measured the effects of these inaccuracies by comparing trace-driven simulations to direct simulations of the same workloads. The simulators predicted identical performance only for workloads whose traces were timing-independent. Workloads that used first-come first-served scheduling and/or non-deterministic algorithms produced timing-dependent traces, and simulation of these traces produced inaccurate performance predictions. Two types of performance metrics were particularly affected: those related to synchronization latency and those derived from relatively small numbers of events. To accurately predict such performance metrics, timing-independent traces or direct simulation should be used.
Conference Paper
While much current research concerns multiprocessor design, few traces of parallel programs are available for analyzing the effect of design trade-offs. Existing trace collection methods have serious drawbacks: trap-driven methods often slow down program execution by more than 1000 times, significantly perturbing program behavior; microcode modification is faster, but the technique is neither general nor portable.This paper describes a new tool, called MPTRACE, for collecting traces of multithreaded parallel programs executing on shared-memory multiprocessors. MPTRACE requires no hardware or microcode modification; it collects complete program traces; it is portable; and it reduces execution-time dilation to less than a factor 3. MPTRACE is based on inline tracing, in which a program is automatically modified to produce trace information as it executes. We show how the use of compiler flow analysis techniques can reduce the amount of data collected and therefore the runtime dilation of the traced program. We also discuss problematic issues concerning buffering and writing of trace data on a multiprocessor.
Conference Paper
Simulation and tracing tools help in the analysis, design, and tuning of both hardware and software systems. Simulators can execute code for hardware that does not yet exist, can provide access to internal state that may be invisible on real hardware, can give deterministic execution in the face of races, and can produce “stress test” situations that are hard to produce on the real hardware [4]. Tracing tools can provide detaide information about the behavior of a program; that information is used to drive an analyzer that analyzes or predicts the behavior of a particular system component. That, in turn, provides feedback that is used to improve the design and implementation of everything from architectures to compilers to applications. Analyzers consume many kinds of trace information; for example, address traces are used for studies of memory hierarchies, opcode and operand usage for superscalar and pipelined processor design, instruction counts for optimization studies, operand values for memoizing studies, and branch behavior for branch prediction.
Conference Paper
Code instrumentation is a powerful mechanism for understanding program behavior. Unfortunately, code instrumentation is extremely difficult, and therefore has been mostly relegated to building special purpose tools for use on standard industry benchmark suites. ATOM (Analysis Tools with OM) provides a very flexible and efficient code instrumentation interface that allows powerful, high performance program analysis tools to be built with very little effort. This paper illustrates this flexibility by building five complete tools that span the interests of application programmers, computer architects, and compiler writers. This flexibility does not come at the expense of performance. Because ATOM uses procedure calls as the interface between the application and the analysis routines, the performance of each tool is similar to or greatly exceeds the best known hand-crafted implementations.
Conference Paper
Trace-driven simulation is often used in the design of computer systems, especially caches and translation lookaside buffers. Capturing address traces to drive such simulations has been problematic, often involving 1000:1 software overhead to trace a target workload, and/or mechanisms that cause significant distortions in the recorded data. A new technique for capturing address traces has been developed to use a processor's microcode to record addresses in a reserved part of main memory as a side effect of normal execution. An experimental implementation of this technique on a VAX1 8200 processor shows a number of advantages over previous techniques, including fewer distortions of the address trace and a hundred times faster recording. With this technique, it is possible to gather full operating-system traces of multi-tasking workloads.
Conference Paper
EEL (Executable Editing Library) is a library for building tools to analyze and modify an executable (compiled) program. The systems and languages communities have built many tools for error detection, fault isolation, architecture translation, performance measurement, simulation, and optimization using this approach of modifying executables. Currently, however, tools of this sort are difficult and time-consuming to write and are usually closely tied to a particular machine and operating system. EEL supports a machine- and system-independent editing model that enables tool builders to modify an executable without being aware of the details of the underlying architecture or operating system or being concerned with the consequences of deleting instructions or adding foreign code.
Unreachable procedures are procedures that can never be invoked. Their existence may adversely affect the performance of a program. Unfortunately, their detection requires the entire program to be present. Using a link-time code modification system, we analyze large linked program modules of C++, C and Fortran. We find that C++ programs using objectoriented programming style contain a large fraction of unreachable procedure code. In contrast, C and Fortran programs have a low and essentially constant fraction of unreachable code. In this paper, we present our analysis of C++, C and Fortran programs, and discuss how object-oriented programming style generates unreachable procedures. This paper will appear in the ACM LOPLAS Vol 1, #4.. It replaces Technical Note TN-21, an earlier version of the same material. i 1 Introduction Unreachable procedures unnecessarily bloat an executable, making it require more disk space and decreasing its locality, which may affect its cache and paging be...
Article
Compilers for new machines with 64-bit addresses must generate code that works when the memory used by the program is large. Procedures and global variables are accessed indirectly via global address tables , and calling conventions include code to establish the addressability of the appropriate tables. In the common case of a program that does not require a lot of memory, all of this can be simplified considerably, with a corresponding reduction in program size and execution time. We have used our link-time code modification system OM to perform program transformations related to global address use on the Alpha AXP. Though simple, many of these are whole-program optimizations that can be done only when we can see the entire program at once, so link-time is an ideal occasion to perform them. This paper describes the optimizations performed and shows their effects on program size and performance. Relatively modest transformations, possible without moving code, improve the performance of SPEC benchmarks by an average of 1.5%. More ambitious transformations, requiring an understanding of program structure that is thorough but not difficult at link-time, can do even better, reducing program size by 10% or more, and improving performance by an average of 3.8%. Even a program compiled monolithically with interprocedural optimization can benefit nearly as much from this technique, if it contains statically-linked pre-compiled library code. When the benchmark sources were compiled in this way, we were still able to improve their performance by 1.35% with the modest transformations and 3.4% with the ambitious transformations.
Article
Modifying code after the compiler has generated it can be useful for both optimization and instrumentation. Several years ago we designed the Mahler system, which uses link-time code modification for a variety of tools on our experimental Titan workstations. Killian's Pixie tool works even later, translating a fully-linked MIPS executable file into a new version with instrumentation added. Recently we wanted to develop a hybrid of the two, that would let us experiment with both optimization and instrumentation on a standard workstation, preferably without requiring us to modify the normal compilers and linker. This paper describes prototypes of two hybrid systems, closely related to Mahler and Pixie. We implemented basic-block counting in both, and compare the resulting time and space expansion to those of Mahler and Pixie. This paper will appear in the proceedings of the CODE 91 Workshop on Code Generation, to be published as part of the Springer series of Workshops in Co...
Article
Inserting instrumentation code in a program is an effective technique for detecting, recording, and measuring many aspects of a program's performance. Instrumentation code can be added at any stage of the compilation process by specially-modified system tools such as a compiler or linker or by new tools from a measurement system. For several reasons, adding instrumentation code after the compilation process—by rewriting the executable file—presents fewer complications and leads to more complete measurements. This paper describes the difficulties in adding code to executable files that arose in developing the profiling and tracing tools qp and qpt. The techniques used by these tools to instrument programs on MIPS and SPARC processors are applicable in other instrumentation systems running on many processors and operating systems. In addition, many difficulties could have been avoided with minor changes to compilers and executable file formats. These changes would simplify this approach to measuring program performance and make it more generally useful.
Article
We have developed a system called OM to explore the problem of code optimization at link-time. OM takes a collection of object modules constituting the entire program, and converts the object code into a symbolic Register Transfer Language (RTL) form that can be easily manipulated. This RTL is then transformed by intermodule optimization and finally converted back into object form. Although much high-level information about the program is gone at link-time, this approach enables us to perform optimizations that a compiler looking at a single module cannot see. Since object modules are more or less independent of the particular source language or compiler, this also gives us the chance to improve the code in ways that some compilers might simply have missed. To test the concept, we have used OM to build an optimizer that does interprocedural code motion. It moves simple loop-invariant code out of loops, even when the loop body extends across many procedures and the loop control is in a ...
Some Efficient Architectures Simulation Techniques. Winter 1990 USENIX Conference
• Robert Bedichek
Robert Bedichek. Some Efficient Architectures Simulation Techniques. Winter 1990 USENIX Conference, January 1990.
PROTEUS: A High-Performance Parallel-Architecture Simula-tor Shade A Fast Instruction-Set Simulator for Execution Profiling
• Eric A Brewer
• Chrysanthos N Dellarocas