Conference Paper

Valgrind: A framework for heavyweight dynamic binary instrumentation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Dynamic binary instrumentation (DBI) frameworks make it easy to build dynamic binary analysis (DBA) tools such as checkers and profilers. Much of the focus on DBI frameworks has been on performance; little attention has been paid to their capabilities. As a result, we believe the potential of DBI has not been fully exploited. In this paper we describe Valgrind, a DBI framework designed for building heavyweight DBA tools. We focus on its unique support for shadow values-a powerful but previously little-studied and difficult-to-implement DBA technique, which requires a tool to shadow every register and memory value with another value that describes it. This support accounts for several crucial design features that distinguish Valgrind from other DBI frameworks. Because of these features, lightweight tools built with Valgrind run comparatively slowly, but Valgrind can be used to build more interesting, heavyweight tools that are difficult or impossible to build with other DBI frameworks such as Pin and DynamoRIO.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Alibaba clusters currently accommodate more than ten thousand applications, which are built upon 10+ programming languages and various development frameworks. The scale and diversity make it challenging to apply well-crafted but customized instrumentation methods for intra-service tracing observability [66,72]. Although one can instrument the shared underlying operating systems to obtain kernel-level execution traces with eBPF tools [1, 41], the more vital user-level execution traces are still black boxes to be observed. ...
... The second category consists of chronological instrumentation methods, which inserts tracepoints into the program [10,66,72]. Schemes like DynamoRIO [10], Valgrind [72], and LLVM [61] assist in implementing dynamic analyses on different abstraction layers (assembly [68] or library [88]). ...
... The second category consists of chronological instrumentation methods, which inserts tracepoints into the program [10,66,72]. Schemes like DynamoRIO [10], Valgrind [72], and LLVM [61] assist in implementing dynamic analyses on different abstraction layers (assembly [68] or library [88]). Compared to the five instrumentation-based methods in Table 5 (three Linux-native and two academic tools), EXIST can easily capture user-level instruction-granularity traces continuously with no intrusion, while providing satisfactory intraservice tracing functionality with low overhead. ...
... They are popular among programmers because of their precision-for many analyses, they report no false positives-and can pinpoint the exact location of errors, down to the individual line of code. Perhaps the most prominent and widely used dynamic analysis tool for C/C++ binaries is Valgrind [28]. Valgrind's most popular use case, via its default tool, MemCheck, can find a wide range of memory errors, including buffer overflows, use-after-free errors, and memory leaks. ...
... Dynamic Instrumentation: Numerous error detection tools use dynamic instrumentation, including many commercial tools. Valgrind's Memcheck tool, Dr. Memory, Purify, Intel Inspector, and Sun Discover all fall into this category [8,14,18,28,32]. These tools use dynamic instrumentation engines, such as Pin, Valgrind, and DynamiRIO [7,24,28]. ...
... Valgrind's Memcheck tool, Dr. Memory, Purify, Intel Inspector, and Sun Discover all fall into this category [8,14,18,28,32]. These tools use dynamic instrumentation engines, such as Pin, Valgrind, and DynamiRIO [7,24,28]. These tools can detect memory leaks, use-after-free errors, uninitialized reads, and buffer overflows. ...
Preprint
This paper presents evidence-based dynamic analysis, an approach that enables lightweight analyses--under 5% overhead for these bugs--making it practical for the first time to perform these analyses in deployed settings. The key insight of evidence-based dynamic analysis is that for a class of errors, it is possible to ensure that evidence that they happened at some point in the past remains for later detection. Evidence-based dynamic analysis allows execution to proceed at nearly full speed until the end of an epoch (e.g., a heavyweight system call). It then examines program state to check for evidence that an error occurred at some time during that epoch. If so, it rolls back execution and re-executes the code with instrumentation activated to pinpoint the error. We present DoubleTake, a prototype evidence-based dynamic analysis framework. DoubleTake is practical and easy to deploy, requiring neither custom hardware, compiler, nor operating system support. We demonstrate DoubleTake's generality and efficiency by building dynamic analyses that find buffer overflows, memory use-after-free errors, and memory leaks. Our evaluation shows that DoubleTake is efficient, imposing just 4% overhead on average, making it the fastest such system to date. It is also precise: DoubleTake pinpoints the location of these errors to the exact line and memory addresses where they occur, providing valuable debugging information to programmers.
... Emulation. Emulators and VMs such as QEMU [5] and Valgrind [25] track instrumentation points separately from the application code, requiring no rewriting or re-organization for better developer ergonomics. Instrumentation is invoked outside of the emulated state space, which prevents altering behavior. ...
... QEMU [5] is a widely-used CPU emulator that virtualizes a user-space process while supporting non-intrusive instrumentation. Valgrind [25], primarily used as a memory debugger, is similar. They include JIT compilers that are not necessarily "purpose-built" for instrumentation, but for cross-compilation, leading to an inherent overhead over virtual machines. ...
Preprint
Full-text available
Debugging and monitoring programs are integral to engineering and deploying software. Dynamic analyses monitor applications through source code or IR injection, machine code or bytecode rewriting, and virtual machine or direct hardware support. While these techniques are viable within their respective domains, common tooling across techniques is rare, leading to fragmentation of skills, duplicated efforts, and inconsistent feature support. We address this problem in the WebAssembly ecosystem with Whamm, a declarative instrumentation DSL for WebAssembly that abstracts above the instrumentation strategy, leveraging bytecode rewriting and engine support as available. Whamm solves three problems: 1) tooling fragmentation, 2) prohibitive instrumentation overhead of general-purpose frameworks, and 3) tedium of tailoring low-level high-performance mechanisms. Whamm provides fully-programmable instrumentation with declarative match rules, static and dynamic predication, automatic state reporting, and user library support, while achieving high performance through compiler and engine optimizations. At the back end, Whamm provides instrumentation to a Wasm engine as Wasm code, reusing existing engine optimizations and unlocking new ones, most notably intrinsification, to minimize overhead. In particular, explicitly requesting program state in match rules, rather than reflection, enables the engine to efficiently bundle arguments and even inline compiled probe logic. Whamm streamlines the tooling effort, as its bytecode-rewriting target can run instrumented programs everywhere, lowering fragmentation and advancing the state of the art for engine support. We evaluate Whamm with case studies of non-trivial monitors and show it is expressive, powerful, and efficient.
... For comparison the relative timings for TOPCOM are given in Table 2. All these numbers were determined with Valgrind's tool callgrind [19]. In both cases the bulk of the time is spent on finding and processing the flips. ...
... Figure 5 shows how the total amount of memory consumed depends on the cache sizes. The measurements have been taken by Valgrind's tool massif which records memory snapshots in fixed time intervals [19]. The left hand side corresponds to the computation in Figure 4 for ∆ 2 × ∆ 6 . ...
Preprint
We report on the implementation of an algorithm for computing the set of all regular triangulations of finitely many points in Euclidean space. This algorithm, which we call down-flip reverse search, can be restricted, e.g., to computing full triangulations only; this case is particularly relevant for tropical geometry. Most importantly, down-flip reverse search allows for massive parallelization, i.e., it scales well even for many cores. Our implementation allows to compute the triangulations of much larger point sets than before.
... Applications such as PolyOMP [4], OmpVerify [35], and DRACO [36] are static methods of identifying runtime errors in the OpenMP API, whereas Helgrind [36,37], Valgrind [38,39], Intel ® Thread Checker [40,41], Sun Studio ® thread analyzer [40,42], ROMP [43], and RTED [32] are dynamic methods. On the other hand, AddressSanitizer (ASan) and ThreadSanitizer (TSan) [44,45] use OpenMP's directives to identify runtime issues related to architecture. ...
Article
Full-text available
The growing adoption of supercomputers across various scientific disciplines, particularly by researchers without a background in computer science, has intensified the demand for parallel applications. These applications are typically developed using a combination of programming models within languages such as C, C++, and Fortran. However, modern multi-core processors and accelerators necessitate fine-grained control to achieve effective parallelism, complicating the development process. To address this, developers commonly utilize high-level programming models such as Open Multi-Processing (OpenMP), Open Accelerators (OpenACCs), Message Passing Interface (MPI), and Compute Unified Device Architecture (CUDA). These models may be used independently or combined into dual-or tri-model applications to leverage their complementary strengths. However, integrating multiple models introduces subtle and difficult-to-detect runtime errors such as data races, deadlocks, and livelocks that often elude conventional compilers. This complexity is exacerbated in applications that simultaneously incorporate MPI, OpenMP, and CUDA, where the origin of runtime errors, whether from individual models , user logic, or their interactions, becomes ambiguous. Moreover, existing tools are inadequate for detecting such errors in tri-model applications, leaving a critical gap in development support. To address this gap, the present study introduces a static analysis tool designed specifically for tri-model applications combining MPI, OpenMP, and CUDA in C++-based environments. The tool analyzes source code to identify both actual and potential runtime errors prior to execution. Central to this approach is the introduction of error dependency graphs, a novel mechanism for systematically representing and analyzing error correlations in hybrid applications. By offering both error classification and comprehensive static detection, the proposed tool enhances error visibility and reduces manual testing effort. This contributes significantly to the development of more robust parallel applications for high-performance computing (HPC) and future exascale systems. Citation: Altalhi, S.M.; Eassa, F.E.; Sharaf, S.A.; Alghamdi, A.M.; Almarhabi, K.A.; Khalid, R.A.B. Error Classification and Static Detection Methods in Tri-Programming Models: MPI, OpenMP, and CUDA. Computers 2025, 14, 164. https://doi.
... We profile the execution of the state-of-the-art software implementation of the SBA [24] using Valgrind [29] to characterize the application and identify the main bottlenecks. To the best of our knowledge, the literature does not provide this type of evaluation for the SBA, which can help deploy high-performance decoders. ...
Article
Quantum computing is a new paradigm of computation that exploits principles from quantum mechanics to achieve an exponential speedup compared to classical logic. However, noise strongly limits current quantum hardware, reducing achievable performance and limiting the scaling of the applications. For this reason, current noisy intermediate-scale quantum devices require Quantum Error Correction (QEC) mechanisms to identify errors occurring in the computation and correct them in real time. Nevertheless, the high computational complexity of QEC algorithms is incompatible with the tight time constraints of quantum devices. Thus, hardware acceleration is paramount to achieving real-time QEC. This work presents QASBA, an FPGA-based hardware accelerator for the Sparse Blossom Algorithm (SBA), a state-of-the-art decoding algorithm. After profiling the state-of-the-art software counterpart, we developed a design methodology for hardware development based on the SBA. We also devised an automation process to help users without expertise in hardware design in deploying architectures based on QASBA. We implement QASBA on different FPGA architectures and experimentally evaluate resource usage, execution time, and energy efficiency of our solution. Our solution attains up to 25.05× speedup and 304.16× improvement in energy efficiency compared to the software baseline.
... If a kernel has no callees, then the include time equals the self time. Since this is often the case, we only report the include time.The profiling results are obtained by the Callgrind tool [11] tool (a module of Valgrind [12]); the column no-MGRID shows the timings without the multigrid preconditioner. The most time-consuming kernel is the sparse matrix-vector multiplication (ax), which calls the local grad3, local grad3 t, add2 and mxm (the call graph is shown in Figure 2). ...
Preprint
Full-text available
Mixed-precision computing has the potential to significantly reduce the cost of exascale computations, but determining when and how to implement it in programs can be challenging. In this article, we propose a methodology for enabling mixed-precision with the help of computer arithmetic tools, roofline model, and computer arithmetic techniques. As case studies, we consider Nekbone, a mini-application for the Computational Fluid Dynamics (CFD) solver Nek5000, and a modern Neko CFD application. With the help of the VerifiCarlo tool and computer arithmetic techniques, we introduce a strategy to address stagnation issues in the preconditioned Conjugate Gradient method in Nekbone and apply these insights to implement a mixed-precision version of Neko. We evaluate the derived mixed-precision versions of these codes by combining metrics in three dimensions: accuracy, time-to-solution, and energy-to-solution. Notably, mixed-precision in Nekbone reduces time-to-solution by roughly 38% and energy-to-solution by 2.8x on MareNostrum 5, while in the real-world Neko application the gain is up to 29% in time and up to 24% in energy, without sacrificing the accuracy.
... Programs can be observed using DBM. Pin [17] and Valgrind [19] work as JIT compilers to add instrumentation code around the original program instructions. GDB [29] dynamically adds int3 instructions to interrupt program to balance between execution speed and code analysis. ...
... Programs can be observed using DBM. Pin [17] and Valgrind [19] work as JIT compilers to add instrumentation code around the original program instructions. GDB [29] dynamically adds int3 instructions to interrupt program to balance between execution speed and code analysis. ...
Preprint
Full-text available
Context: Just-in-Time (JIT) compilers are able to specialize the code they generate according to a continuous profiling of the running programs. This gives them an advantage when compared to Ahead-of-Time (AoT) compilers that must choose the code to generate once for all. Inquiry: Is it possible to improve the performance of AoT compilers by adding Dynamic Binary Modification (DBM) to the executions? Approach: We added to the Hopc AoT JavaScript compiler a new optimization based on DBM to the inline cache (IC), a classical optimization dynamic languages use to implement object property accesses efficiently. Knowledge: Reducing the number of memory accesses as the new optimization does, does not shorten execution times on contemporary architectures. Grounding: The DBM optimization we have implemented is fully operational on x86_64 architectures. We have conducted several experiments to evaluate its impact on performance and to study the reasons of the lack of acceleration. Importance: The (negative) result we present in this paper sheds new light on the best strategy to be used to implement dynamic languages. It tells that the old days were removing instructions or removing memory reads always yielded to speed up is over. Nowadays, implementing sophisticated compiler optimizations is only worth the effort if the processor is not able by itself to accelerate the code. This result applies to AoT compilers as well as JIT compilers.
... Static analysis checks for errors without executing code and can be applied to an entire code base before dynamic tests are executed. This is useful because runtime tools, such as Valgrind [71], only test the code that is executed and thus can miss corner cases that are not tested [32]. Dynamic testing comprises unit tests, integration tests and system tests. ...
Preprint
Scientific machine learning (SciML) models are transforming many scientific disciplines. However, the development of good modeling practices to increase the trustworthiness of SciML has lagged behind its application, limiting its potential impact. The goal of this paper is to start a discussion on establishing consensus-based good practices for predictive SciML. We identify key challenges in applying existing computational science and engineering guidelines, such as verification and validation protocols, and provide recommendations to address these challenges. Our discussion focuses on predictive SciML, which uses machine learning models to learn, improve, and accelerate numerical simulations of physical systems. While centered on predictive applications, our 16 recommendations aim to help researchers conduc
... Dynamic analysis, in contrast, gathers insights about the Program Under Test (PUT)'s behavior during execution. Instrumentation for dynamic analysis can be introduced at source level (e.g., ASAN [63] or AFL++ [24]), binary level (e.g., RetroWrite [20], QEMU [11] or Valgrind [52]), or through hardware tracing features (e.g., Intel Processor Trace [37] or Arm CoreSight [8]). Dynamic binary analysis potentially only covers the actually executed subset of all the code present in the PUT. ...
... It not only detects data races but also evaluates possible repercussions to classify them depending on severity. Helgrind [105]is a versatile race detection tool within the Valgrind framework, employing a hybrid approach that combines both dynamic and static analysis methods for C/C++ programs. It effectively identifies race conditions by utilizing lockset-based analysis to uncover potential data races. ...
... Dynamically linking with the application allows us to modify libc calls before they are sent through the kernel, allowing a simple IO path through the Kernel, while the complexity resides in userspace. LD_PRELOAD is used by projects ranging from custom allocators to debugging tools [17,37]. Table 2 presents the design space of shim techniques, and we chose the one with the best performance, and while its limitations could leave certain application ports out, the principles we discuss could be realized with other techniques or even a custom libc. ...
Preprint
Full-text available
The increasing demand for SSDs coupled with scaling difficulties have left manufacturers scrambling for newer SSD interfaces which promise better performance and durability. While these interfaces reduce the rigidity of traditional abstractions, they require application or system-level changes that can impact the stability, security, and portability of systems. To make matters worse, such changes are rendered futile with introduction of next-generation interfaces. Further, there is little guidance on data placement and hardware specifics are often abstracted from the application layer. It is no surprise therefore that such interfaces have seen limited adoption, leaving behind a graveyard of experimental interfaces ranging from open-channel SSDs to zoned namespaces. In this paper, we show how shim layers can to shield systems from changing hardware interfaces while benefiting from them. We present Reshim, an all-userspace shim layer that performs affinity and lifetime based data placement with no change to the operating system or the application. We demonstrate Reshim's ease of adoption with host-device coordination for three widely-used data-intensive systems: RocksDB, MongoDB, and CacheLib. With Reshim, these systems see 2-6 times highe write throughput, up to 6 times lower latency, and reduced write amplification compared to filesystems like F2FS. Reshim performs on par with application-specific backends like ZenFS while offering more generality, lower latency, and richer data placement. With Reshim we demonstrate the value of isolating the complexity of the placement logic, allowing easy deployment of dynamic placement rules across several applications and storage interfaces.
... Для проведения анализа приложения необходимы трассы исполнения, собранные с помощью динамической бинарной инструментации [6], [7], покрывающие большинство сценариев использования приложения. В конечном итоге необходимо получить соотношение кода с выявленными сценариями. ...
Article
Full-text available
Профилирование – сбор характеристик об исполнении программ, которые можно использовать во время оптимизаций. Одним из примеров использования является статический бинарный оптимизатор BOLT. Основной прирост производительности в нём достигается за счёт перекомпоновки кода на основе полученного во время исполнения приложения профиля. Однако, реализована данная оптимизация без учета переключения между сценариями приложения, когда происходит смена собираемых характеристик программы во время её исполнения. Это приводит к тому, что при исполнении приложения на сценарии, отличном от профилируемого, прирост производительности будет меньше, либо будет наблюдаться регрессия. Объединение всех возможных профилей программы в один не решает все возможные проблемы. В данной работе описывается алгоритм мультипрофильного анализа кода, учитывающий смену сценариев работы программы. Данный алгоритм основан на обработке трасс исполнения приложения, собираемых с помощью динамической бинарной инструментации. На основе этого анализа можно проводить оптимизацию по размещению кода в зависимости от его принадлежности определенному сценарию, что будет приводить к повышению локальности кода. Также на основе мультипрофильного анализа можно проводить оптимизацию, производя дублирование кода, попадающего одновременно в несколько сценариев.
... Но для данного подхода необходим исходный код. Основная проблема заключается в необходимости адаптации инфраструктуры сборки проекта для профилирования, кроме того, используемые сторонние библиотеки не предоставляются в инструментированном виде, что осложняет получения полной картины исполнения [2]. Вместо инструментации кода во время компиляции разработаны среды, позволяющие произвести инструментацию во время исполнения приложения. ...
Article
Full-text available
Статическая бинарная оптимизация – один из способов ускорить исполняемый файл без исходного кода. Данная технология применяется в бинарном оптимизаторе BOLT (Binary Optimization and Layout Tool), требующий для своей работы профиль с информацией о взятых и информацией о не предсказанных переходах, которую на x86 архитектуре можно получить из аппаратной очереди LBR (Last Branch Record). Эта информация необходима ему для перекомпоновки кода, благодаря которой уменьшается количество промахов по инструкционному кэшу и буферу ассоциативной трансляции инструкций. В результате проделанной оптимизации удалось ускорить работу серверных приложений на 8.0% работающих на x86 архитектурах. На применение данной оптимизации на ARM архитектуре есть ограничения, связанные с частым отсутствием возможности получить профильную информацию с помощью аппаратных средств. В данной статье описываются разработанные методы и инструменты, позволяющие получить профильную информацию с помощью трассы исполнения приложения. Процесс сбора трассы реализован с помощью динамической бинарной инструментации. В статье описан алгоритм восстановления профильной информации с применением модели предсказателя перехода. Также описываются реализованные изменения BOLT для расширенной поддержки ARM архитектуры. В результате работы удалось достичь целевых показателей прироста производительности на синтетических тестах и тестовых наборах.
... Typically, they utilize either dynamic or static instrumentation to instrument every memory access. Dynamic instrumentation tools do not require the recompilation of source code [18,32,39,58,59], but may increase performance overhead as high as an order of magnitude. Static instrumentation tools may employ static analysis to help reduce the volume of instrumentation [3,28,31,57,61,68]. ...
Preprint
Reproducing executions of multithreaded programs is very challenging due to many intrinsic and external non-deterministic factors. Existing RnR systems achieve significant progress in terms of performance overhead, but none targets the in-situ setting, in which replay occurs within the same process as the recording process. Also, most existing work cannot achieve identical replay, which may prevent the reproduction of some errors. This paper presents iReplayer, which aims to identically replay multithreaded programs in the original process (under the "in-situ" setting). The novel in-situ and identical replay of iReplayer makes it more likely to reproduce errors, and allows it to directly employ debugging mechanisms (e.g. watchpoints) to aid failure diagnosis. Currently, iReplayer only incurs 3% performance overhead on average, which allows it to be always enabled in the production environment. iReplayer enables a range of possibilities, and this paper presents three examples: two automatic tools for detecting buffer overflows and use-after-free bugs, and one interactive debugging tool that is integrated with GDB.
... Some tools have reached high prominence and active use in the debugging of memory errors. ese include Valgrind [30] which offers a suite of six tools for debugging, profiling, and error detection, including the detection of memory errors. AddressSanitizer [41] provides something of a gold standard in run-time memory error detection, but is unsuitable for run-time production use due to performance overhead. ...
Preprint
The security of billions of devices worldwide depends on the security and robustness of the mainline Linux kernel. However, the increasing number of kernel-specific vulnerabilities, especially memory safety vulnerabilities, shows that the kernel is a popular and practically exploitable target. Two major causes of memory safety vulnerabilities are reference counter overflows (temporal memory errors) and lack of pointer bounds checking (spatial memory errors). To succeed in practice, security mechanisms for critical systems like the Linux kernel must also consider performance and deployability as critical design objectives. We present and systematically analyze two such mechanisms for improving memory safety in the Linux kernel: (a) an overflow-resistant reference counter data structure designed to accommodate typical reference counter usage in kernel source code, and (b) runtime pointer bounds checking using Intel MPX in the kernel.
... The first important work is the Valgrind 1 multiplatform dynamic binary instrumentation framework. Valgrind loads the target into memory and instruments it (Nethercote and Seward, 2007) in a manner that is similar with our approach. Among Valgrind's users we mention the DAIKON invariant detection system (Perkins and Ernst, 2004) as well as the TaintCheck system (Newsome, 2005). ...
Preprint
The present paper introduces the open-source Java Event Tracer (JETracer) framework for real-time tracing of GUI events within applications based on the AWT, Swing or SWT graphical toolkits. Our framework provides a common event model for supported toolkits, the possibility of receiving GUI events in real-time, good performance in the case of complex target applications and the possibility of deployment over a network. The present paper provides the rationale for JETracer, presents related research and details its technical implementation. An empirical evaluation where JETracer is used to trace GUI events within five popular, open-source applications is also presented.
... The core functionality of VaST is implemented in C. Valgrind 36 (Nethercote and Seward 2007) tools are used for profiling (Nethercote et al. 2006) and detecting memory errors and leaks (Seward and Nethercote 2005). Parallel processing is implemented using OpenMP. ...
Preprint
Variability Search Toolkit (VaST) is a software package designed to find variable objects in a series of sky images. It can be run from a script or interactively using its graphical interface. VaST relies on source list matching as opposed to image subtraction. SExtractor is used to generate source lists and perform aperture or PSF-fitting photometry (with PSFEx). Variability indices that characterize scatter and smoothness of a lightcurve are computed for all objects. Candidate variables are identified as objects having high variability index values compared to other objects of similar brightness. The two distinguishing features of VaST are its ability to perform accurate aperture photometry of images obtained with non-linear detectors and handle complex image distortions. The software has been successfully applied to images obtained with telescopes ranging from 0.08 to 2.5m in diameter equipped with a variety of detectors including CCD, CMOS, MIC and photographic plates. About 1800 variable stars have been discovered with VaST. It is used as a transient detection engine in the New Milky Way (NMW) nova patrol. The code is written in C and can be easily compiled on the majority of UNIX-like systems. VaST is free software available at http://scan.sai.msu.ru/vast/
... Since our approach leads to a runtime verification tool, this can be compared to other such existing tools. With the exception of valgrind [8], most tools in this category focus on Java programs. For instance, Java PathExplorer [4] executes annotated Java byte code, along with a monitor which can check various properties, including past-time LTL. ...
Preprint
We describe a novel approach for adapting an existing software model checker to perform precise runtime verification. The software under test is allowed to communicate with the wider environment (including the file system and network). The modifications to the model checker are small and self-contained, making this a viable strategy for re-using existing model checking tools in a new context. Additionally, from the data that is gathered during a single execution in the runtime verification mode, we automatically re-construct a description of the execution environment which can then be used in the standard, full-blown model checker. This additional verification step can further improve coverage, especially in the case of parallel programs, without introducing substantial overhead into the process of runtime verification.
... Our primary motivation is the possibility of using synthesized grammars with grammar-based fuzzers [23,28,38]. For example, such inputs can be used to find bugs in real-world programs [24,39,48,67], learn abstractions [41], predict performance [30], and aid dynamic analysis [42]. Beyond fuzzing, a grammar synthesis algorithm could be used to reverse engineer input formats [29], in particular, network protocol message formats can help security analysts discover vulnerabilities in network programs [8,35,36,66]. ...
Preprint
We present an algorithm for synthesizing a context-free grammar encoding the language of valid program inputs from a set of input examples and blackbox access to the program. Our algorithm addresses shortcomings of existing grammar inference algorithms, which both severely overgeneralize and are prohibitively slow. Our implementation, GLADE, leverages the grammar synthesized by our algorithm to fuzz test programs with structured inputs. We show that GLADE substantially increases the incremental coverage on valid inputs compared to two baseline fuzzers.
... This makes stringent unit testing particularly important. We have found running unit tests with Valgrind/Memcheck [20] to be indispensable for detecting memory leaks, uses of uninitialized variables, outof-bounds array accesses, and other similar mistakes. ...
Preprint
Arb is a C library for arbitrary-precision interval arithmetic using the midpoint-radius representation, also known as ball arithmetic. It supports real and complex numbers, polynomials, power series, matrices, and evaluation of many special functions. The core number types are designed for versatility and speed in a range of scenarios, allowing performance that is competitive with non-interval arbitrary-precision types such as MPFR and MPC floating-point numbers. We discuss the low-level number representation, strategies for precision and error bounds, and the implementation of efficient polynomial arithmetic with interval coefficients.
... Other tools for performance analysis such as Valgrind (Nethercote and Seward, 2007) collect very detailed performance information. This comes at the expense of execution time: running with Valgrind, the processing time slows by up to two orders of magnitude. ...
Preprint
Modern astronomical data processing requires complex software pipelines to process ever growing datasets. For radio astronomy, these pipelines have become so large that they need to be distributed across a computational cluster. This makes it difficult to monitor the performance of each pipeline step. To gain insight into the performance of each step, a performance monitoring utility needs to be integrated with the pipeline execution. In this work we have developed such a utility and integrated it with the calibration pipeline of the Low Frequency Array, LOFAR, a leading radio telescope. We tested the tool by running the pipeline on several different compute platforms and collected the performance data. Based on this data, we make well informed recommendations on future hardware and software upgrades. The aim of these upgrades is to accelerate the slowest processing steps for this LOFAR pipeline. The pipeline collector suite is open source and will be incorporated in future LOFAR pipelines to create a performance database for all LOFAR processing.
... We opted to use two different backends to better understand the strengths of each approach. These are detailed further in Section 4. angr utilizes VEX IR, which was originally created for Valgrind [40] -a dynamic program instrumentation tool. Fie embeds the KLEE symbolic execution engine [16], which uses LLVM IR, originally developed by the LLVM project as a compilation target. ...
Preprint
The USB protocol has become ubiquitous, supporting devices from high-powered computing devices to small embedded devices and control systems. USB's greatest feature, its openness and expandability, is also its weakness, and attacks such as BadUSB exploit the unconstrained functionality afforded to these devices as a vector for compromise. Fundamentally, it is virtually impossible to know whether a USB device is benign or malicious. This work introduces FirmUSB, a USB-specific firmware analysis framework that uses domain knowledge of the USB protocol to examine firmware images and determine the activity that they can produce. Embedded USB devices use microcontrollers that have not been well studied by the binary analysis community, and our work demonstrates how lifters into popular intermediate representations for analysis can be built, as well as the challenges of doing so. We develop targeting algorithms and use domain knowledge to speed up these processes by a factor of 7 compared to unconstrained fully symbolic execution. We also successfully find malicious activity in embedded 8051 firmwares without the use of source code. Finally, we provide insights into the challenges of symbolic analysis on embedded architectures and provide guidance on improving tools to better handle this important class of devices.
... Several approaches have used hardware performance counters to identify hardware-level performance bottlenecks [8,12,32]. Techniques based on binary instrumentation can identify cache and heap performance issues, contended locks, and other program hotspots [5,31,36]. ParaShares and Harmony identify basic blocks that run during periods with little or no parallelism [25,26]. Code identified by these tools is a good candidate for parallelization or classic serial optimizations. ...
Preprint
Improving performance is a central concern for software developers. To locate optimization opportunities, developers rely on software profilers. However, these profilers only report where programs spent their time: optimizing that code may have no impact on performance. Past profilers thus both waste developer time and make it difficult for them to uncover significant optimization opportunities. This paper introduces causal profiling. Unlike past profiling approaches, causal profiling indicates exactly where programmers should focus their optimization efforts, and quantifies their potential impact. Causal profiling works by running performance experiments during program execution. Each experiment calculates the impact of any potential optimization by virtually speeding up code: inserting pauses that slow down all other code running concurrently. The key insight is that this slowdown has the same relative effect as running that line faster, thus "virtually" speeding it up. We present Coz, a causal profiler, which we evaluate on a range of highly-tuned applications: Memcached, SQLite, and the PARSEC benchmark suite. Coz identifies previously unknown optimization opportunities that are both significant and targeted. Guided by Coz, we improve the performance of Memcached by 9%, SQLite by 25%, and accelerate six PARSEC applications by as much as 68%; in most cases, these optimizations involve modifying under 10 lines of code.
... Binary instrumentation. Dynamic program analysis tools based on system emulators [2,18] and just-in-time instrumentation [3,26,28] inevitably suffer from a high runtime overhead. Static reassemble disassemblers [10,12,45] instrument the binaries on the assembly code level by creating reassemble disassembly; LLVM rewriters [46] further lift the binaries to LLVM IR which allows the application of LLVM compiler passes. ...
Preprint
Full-text available
Speculative execution is crucial in enhancing modern processor performance but can introduce Spectre-type vulnerabilities that may leak sensitive information. Detecting Spectre gadgets from programs has been a research focus to enhance the analysis and understanding of Spectre attacks. However, one of the problems of existing approaches is that they rely on the presence of source code (or are impractical in terms of run-time performance and gadget detection ability). This paper presents Teapot, the first Spectre gadget scanner that works on COTS binaries with comparable performance to compiler-based alternatives. As its core principle, we introduce Speculation Shadows, a novel approach that separates the binary code for normal execution and speculation simulation in order to improve run-time efficiency. Teapot is based on static binary rewriting. It instruments the program to simulate the effects of speculative execution and also adds integrity checks to detect Spectre gadgets at run time. By leveraging fuzzing, Teapot succeeds in efficiently detecting Spectre gadgets. Evaluations show that Teapot outperforms both performance (more than 20x performant) and gadget detection ability than a previously proposed binary-based approach.
... To evaluate the effectiveness of our design, we implemented an in-house trace-based simulator, including a simulated CPU design and a mini Linux-based kernel. The front end of our tracebased simulator adopts the dynamic binary instruction tools, Valgrind [12], to capture the accessed virtual addresses generated by each benchmark or application. Our simulated CPU simulates a 16-way set associative 8MB Last-Level Cache (LLC), where a half size of the LLC will be configured as the pre-execute cache for both Sync_Runahead and ITS. ...
Article
Android malware authors often use packers to evade analysis. Although many unpacking tools have been proposed, they face two significant challenges: 1) They are easily impeded by anti-analysis techniques employed by packers, preventing efficient collection of hidden Dex data. 2) They are typically designed to unpack a specific packer and cannot handle malware packed with mixed packers. Consequently, many packed malware samples evade detection. To bridge this gap, we propose Gupacker, a novel generalized unpacking framework. Gupacker offers a generic solution for first-generation holistic packer by customizing the Android system source code. It identifies the type of packer and selects an appropriate unpacking function, constructs a deeper active call chain to achieve generic unpacking of second-generation function extraction packers, and uses JNI function and instruction monitoring to handle third-generation virtual obfuscation packer. On this basis, we counteract a diverse array of anti-analysis techniques. We conduct extensive experiments on 5K packed Android malware samples, comparing Gupacker with 2 commercial and 4 state-of-the-art academic unpacking tools. The results demonstrate that Gupacker significantly improves the efficiency of Android malware unpacking with acceptable system overhead. We analyze real packed applications based on Gupacker and found several are second-packed by attackers, including WPS for Android, with tens of millions of users. We receive and responsibly report 13 0day vulnerabilities and also assist in the remediation of all vulnerabilities.
Preprint
人工检测漏洞和利用漏洞费时且易出错,研究者提出多种漏洞检测和漏洞利用方案,其中自动化漏洞利用方法近年来得到研究者的广泛关注。自动化漏洞利用可以快速发现和利用漏洞,并通过批量生成复杂攻击路径与策略的方式,对特定目标发起大规模、定制化攻击。现有工作缺乏对自动化漏洞利用技术系统性的分类与讨论,本文的主要研究内容包括:(1)将自动化漏洞利用技术的发展路线划分为单任务对抗、多任务对抗和大模型智能体三个阶段,并讨论了现有数据集的局限性。(2)将自动化漏洞利用过程划分为漏洞检测和致效载荷生成两个步骤,为了找出触发漏洞的根本原因,漏洞检测重点讨论了漏洞识别和漏洞定位技术;生成突防能力强的致效载荷或漏洞验证程序,致效代码生成重点讨论了致效原语生成和防御机制突破。(3)讨论了大语言模型智能体处理真实漏洞的局限性,包括未知漏洞检测、致效载荷的突防能力和代码可靠性。(4)讨论了自动化漏洞利用技术在CTF竞赛和渗透测试中的应用前景。
Article
Algorithmic differentiation (AD) is a set of techniques that provide partial derivatives of computer-implemented functions. Such functions can be supplied to state-of-the-art AD tools via their source code , or via intermediate representations produced while compiling their source code. We present the novel AD tool Derivgrind, which augments the machine code of compiled programs with forward-mode AD logic. Derivgrind leverages the Valgrind instrumentation framework for structured access to the machine code, and a shadow memory tool to store dot values. Access to the source code is required at most for the files in which input and output variables are defined. Derivgrind’s versatility mainly comes at the price of reduced run-time performance. According to our extensive regression test suite, Derivgrind produces correct results on GCC- and Clang-compiled programs, including a Python interpreter, with a small number of exceptions. We provide a list of “bit-tricks” that Derivgrind does not handle correctly, some of which actually appear in highly optimized math libraries. As long as differentiating those is avoided, Derivgrind enables black-box forward-mode AD for an unprecedentedly wide range of cross-language software with little integration efforts.
Preprint
The string graph for a collection of next-generation reads is a lossless data representation that is fundamental for de novo assemblers based on the overlap-layout-consensus paradigm. In this paper, we explore a novel approach to compute the string graph, based on the FM-index and Burrows-Wheeler Transform. We describe a simple algorithm that uses only the FM-index representation of the collection of reads to construct the string graph, without accessing the input reads. Our algorithm has been integrated into the SGA assembler as a standalone module to construct the string graph. The new integrated assembler has been assessed on a standard benchmark, showing that FSG is significantly faster than SGA while maintaining a moderate use of main memory, and showing practical advantages in running FSG on multiple threads.
Article
Due to the economy and low power consumption features, bare-metal IoT devices have been widely used in various areas of our life, and they are usually paired with companion mobile apps to configure them and view their states (a.k.a., appified IoT system). The IoT systems have already become the lucrative and profitable targets for attackers because the compromised IoT devices will pose severe threats to IoT security and reliability. This problem become worse on bare-metal IoT devices since the tradeoff among price, functionality, performance, and energy efficiency usually results in insufficient security protection. Such bare-metal IoT devices usually adopt OTA (Over- The-Air) methods to update firmware, which is managed by the companion apps running on smartphones. Despite the prevalence of these appified IoT systems, there is a lack of systematic research on the security of bare-metal IoT device firmware update (DFU), although recent studies have reported security flaws in such systems. In this paper, we propose a holistic approach to investigate DFU security of these appified IoT systems through collaborative analyzing the bare-metal firmware and the companion app. Additionally, we have developed an IoT system analysis framework named BareDFU to automate the complex and time-consuming analysis tasks and facilitate the investigation. After applying BareDFU to analyze 1,637 companion IoT apps, we found 710 of them contained security flaws spanning all three DFU stages: authentication, firmware acquisition, and firmware verification. Furthermore, we leveraged BareDFU to investigate the baremetal DFU security of six commercial appified IoT systems, and discovered they all had DFU flaws, which we successfully exploited to launch proof-of-concept firmware modification attacks. The affected vendors have acknowledged our findings and addressed the security flaws.
Conference Paper
Full-text available
Dynamic binary instrumentation (DBI) frameworks make it easy to build dynamic binary analysis (DBA) tools such as checkers and profilers. Much of the focus on DBI frameworks has been on performance; little attention has been paid to their capabilities. As a result, we believe the potential of DBI has not been fully exploited. In this paper we describe Valgrind, a DBI framework designed for building heavyweight DBA tools. We focus on its unique support for shadow values -a powerful but previously little-studied and difficult-to-implement DBA technique, which requires a tool to shadow every register and memory value with another value that describes it. This support accounts for several crucial design features that distinguish Valgrind from other DBI frameworks. Because of these features, lightweight tools built with Valgrind run comparatively slowly, but Valgrind can be used to build more interesting, heavyweight tools that are difficult or impossible to build with other DBI frameworks such as Pin and DynamoRIO.
Article
Full-text available
In this paper, we describe DIOTA, a novel method for instrumenting binaries. The technique correctly deals with programs that contain traditionally hard to instrument features such as data in code and code in data. The technique does not re- quire reverse engineering, program understanding tools or heuris- tics about the compiler or linker used. The basic idea is that in- strumented code is generated on the fly, while the original process is used for data accesses. DIOTA comes with a number of use- ful backends to check programs for faulty memory accesses, data races, deadlocks, ... and perform basic tracing operations, e.g. tracing all memory accesses, all code being executed, to perform coverage analysis, ... DIOTA has been implemented for the IA32 architecture running the Linux operating system.
Conference Paper
Full-text available
Modern architecture research relies heavily on application-level detailed pipeline simulation. A time consuming part of building a simulator is correctly emulating the operating system effects, which is required even if the goal is to simulate just the application code, in order to achieve functional correctness of the application's execution. Existing application-level simulators require manually hand coding the emulation of each and every possible system effect (e.g., system call, interrupt, DMA transfer) that can impact the application's execution. Developing such an emulator for a given operating system is a tedious exercise, and it can also be costly to maintain it to support newer versions of that operating system. Furthermore, porting the emulator to a completely different operating system might involve building it all together from scratch.In this paper, we describe a tool that can automatically log operating system effects to guide architecture simulation of application code. The benefits of our approach are: (a) we do not have to build or maintain any infrastructure for emulating the operating system effects, (b) we can support simulation of more complex applications on our application-level simulator, including those applications that use asynchronous interrupts, DMA transfers, etc., and (c) using the system effects logs collected by our tool, we can deterministically re-execute the application to guide architecture simulation that has reproducible results.
Conference Paper
Full-text available
Computer security is severely threatened by software vulnerabilities. Prior work shows that information flow tracking (also referred to as taint analysis) is a promising technique to detect a wide range of security attacks. How- ever, current information flow tracking systems are not very practical, because they either require program annotations, source code, non-trivial hardware extensions, or incur pro- hibitive runtime overheads. This paper proposes a low overhead, software-only in- formation flow tracking system, called LIFT, which mini- mizes run-time overhead by exploiting dynamic binary in- strumentation and optimizations for detecting various types of security attacks without requiring any hardware changes. More specifically, LIFT aggressively eliminates unneces- sary dynamic information flow tracking, coalesces infor- mation checks, and efficiently switches between target pro- grams and instrumented information flow tracking code. We have implemented LIFT on a dynamic binary in- strumentation framework on Windows. Our real-system experiments with two real-world server applications, one client application and eighteen attack benchmarks show that LIFT can effectively detect various types of security attacks. LIFT also incurs very low overhead, only 6.2% for server applications, and 3.6 times on average for seven SPEC INT2000 applications. Our dynamic optimizations are very effective in reducing the overhead by a factor of 5-12 times.
Conference Paper
Full-text available
Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin. Our goals are to provide easy-to-use, portable, transparent, and efficient instrumentation. Instrumentation tools (called Pintools) are written in C/C++ using Pin's rich API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. The API is designed to be architecture independent whenever possible, making Pintools source compatible across different architectures. However, a Pintool can access architecture-specific details when necessary. Instrumentation with Pin is mostly transparent as the application and Pintool observe the application's original, uninstrumented behavior. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated approach delivers significantly better instrumentation performance than similar tools. For example, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-block counting. To illustrate Pin's versatility, we describe two Pintools in daily use to analyze production software. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), Itanium®, and ARM. In the ten months since Pin 2 was released in July 2004, there have been over 3000 downloads from its website.
Article
Full-text available
Redux is a tool that generates dynamic dataflow graphs. It generates these graphs by tracing a program's execution and recording every value-producing operation that takes place, building up a complete computational history of every value produced. For that execution, by considering the parts of the graph reachable from system call inputs, we can choose to see only the dataflow that affects the outside world. Redux works with program binaries, and thus is not restricted to programs written in any particular language.We explain how Redux works, and show how dynamic dataflow graphs give the essence of a program's computation. We show how Redux can be used for debugging and program slicing, and consider a range of other possible uses.
Conference Paper
Full-text available
Dynamic optimization is emerging as a promising approach to overcome many of the obstacles of traditional static compilation. But while there are a number of compiler infrastructures for developing static optimizations, there are very few for developing dynamic optimizations. We present a framework for implementing dynamic analyses and optimizations. We provide an interface for building external modules, or clients, for the DynamoRIO dynamic code modification system. This interface abstracts away many low-level details of the DynamoRIO runtime system while exposing a simple and powerful, yet efficient and lightweight API. This is achieved by restricting optimization units to linear streams of code and using adaptive levels of detail for representing instructions. The interface is not restricted to optimization and can be used for instrumentation, profiling, dynamic translation, etc. To demonstrate the usefulness and effectiveness of our framework, we implemented several optimizations. These improve the performance of some applications by as much as 40% relative to native execution. The average speedup relative to base DynamoRIO performance is 12%.
Article
Full-text available
Dynamic optimization is emerging as a promising approach to overcome many of the obstacles of traditional static compilation. But while there are a number of compiler infrastructures for developing static optimizations, there are very few for developing dynamic optimizations. We present a framework for implementing dynamic analyses and optimizations. We provide an interface for building external modules, or clients, for the DynamoRIO dynamic code modification system. This interface abstracts away many low-level details of the DynamoRIO runtime system while exposing a simple and powerful, yet efficient and lightweight, API. This is achieved by restricting optimization units to linear streams of code and using adaptive levels of detail for representing instructions. The interface is not restricted to optimization and can be used for instrumentation, profiling, dynamic translation, etc.
Article
Full-text available
We describe the design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT for example), or it can come from the execution of a statically compiled native binary. This paper evaluates the Dynamo system in the latter, more challenging situation, in order to emphasize the limits, rather than the potential, of the system. Our experiments demonstrate that even statically optimized native binaries can be accelerated Dynamo, and often by a significant degree. For example, the average performance of --O optimized SpecInt95 benchmark binaries created by the HP product C compiler is improved to a level comparable to their --O4 optimized version running without Dynamo. Dynamo achieves this by focusing its efforts on optimization opportunities that tend to manifest only at runtime, and hence opportunities that might be difficult for a static compiler to exploit. Dynamo's operation is transparent in the sense that it does not depend on any user annotations or binary instrumentation, and does not require multiple runs, or any special compiler, operating system or hardware support. The Dynamo prototype presented here is a realistic implementation running on an HP PA-8000 workstation under the HPUX 10.20 operating system.
Conference Paper
Software vulnerabilities have had a devastating effect on the Internet. Worms such as CodeRed and Slammer can compromise hundreds of thousands of hosts within hours or even minutes, and cause millions of dollars of damage (25, 42). To successfully combat these fast auto- matic Internet attacks, we need fast automatic attack de- tection and filtering mechanisms. In this paper we propose dynamic taint analysis for au- tomatic detection of overwrite attacks, which include most types of exploits. This approach does not need source code or special compilation for the monitored program, and hence works on commodity software. To demonstrate this idea, we have implemented TaintCheck, a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time. We show that TaintCheck reliably detects most types of exploits. We found that TaintCheck produced no false positives for any of the many different programs that we tested. Further, we describe how Taint- Check could improve automatic signature generation in several ways.
Conference Paper
Many important software systems are written in the C programming language. Unfortunately, the C language does not provide strong safety guarantees, and many common programming mistakes introduce type errors that are not caught by the compiler. These errors only manifest themselves at run time through unexpected program behavior, and it is often hard to isolate and identify their causes. This paper presents the Hobbes run-time type checker for compiled C programs. Our tool interprets compiled binaries, tracks type information for all memory and register locations, and reports warnings when a variety of type errors occur. Because the Hobbes type checker does not rely on source code, it is effective in many situations where similar tools are not, such as when full source code is not available or when C source is linked with program fragments written in assembly or other languages.
Article
Valgrind is a programmable framework for creating program supervision tools such as bug detectors and profilers. It executes supervised programs using dynamic binary translation, giving it total control over their every part without requiring source code, and without the need for recompilation or relinking prior to execution.New supervision tools can be easily created by writing skins that plug into Valgrind's core. As an example, we describe one skin that performs Purify-style memory checks for C and C++ programs.
Conference Paper
TaintTrace is a high performance flow tracing tool that protects systems against security exploits. It is based on dynamic execution binary rewriting empowering our tool with fine-grained monitoring of system activities such as the tracking of the usage and propagation of data originated from the network. The challenge lies in minimizing the run-time overhead of the tool. TaintTrace uses a number of techniques such as direct memory mapping to optimize performance. In this paper, we demonstrate that TaintTrace is effective in protecting against various attacks while maintaining a modest slowdown of 5.5 times, offering significant improvements over similar tools.
Conference Paper
Several existing dynamic binary analysis tools use shadow mem- ory—they shadow, in software, every byte of memory used by a program with another value that says something about it. Shadow memory is difficult to implement both efficiently and robustly. Nonetheless, existing shadow memory implementations have not been studied in detail. This is unfortunate, because shadow mem- ory is powerful—for example, some of the existing tools that use it detect critical errors such as bad memory accesses, data races, and uses of uninitialised or untrusted data. In this paper we describe the implementation of shadow mem- ory in Memcheck, a popular memory checker built with Valgrind, a dynamic binary instrumentation framework. This implementation has several novel features that make it efficient: carefully chosen data structures and operations result in a mean slow-down factor of only 22.2 and moderate memory usage. This may sound slow, but we show it is 8.9 times faster and 8.5 times smaller on average than a naive implementation, and shadow memory operations account for only about half of Memcheck's execution time. Equally impor- tantly, unlike some tools, Memcheck's shadow memory implemen- tation is robust: it is used on Linux by thousands of programmers on sizeable programs such as Mozilla and OpenOffice, and is suited to almost any memory configuration. This is the first detailed description of a robust shadow mem- ory implementation, and the first detailed experimental evaluation of any shadow memory implementation. The ideas within are ap- plicable to any shadow memory tool built with any instrumentation framework.
Conference Paper
We present Memcheck, a tool that has been implemented with the dynamic binary instrumentation framework Val- grind. Memcheck detects a wide range of memory errors in programs as they run. This paper focuses on one kind of error that Memcheck detects: undefined value errors. Such errors are common, and often cause bugs that are hard to find in programs written in languages such as C, C++ and Fortran. Memcheck'sdefinedness checking im- proves on that of previous tools by being accurate to the level of individual bits. This accuracy gives Memcheck a low false positive and false negative rate. The definedness checking involves shadowing every bit of data in registers and memory with a second bit that indicates if the bit has a defined value. Every value- creating operation is instrumented with a shadow oper- ation that propagates shadow bits appropriately. Mem- check uses these shadow bits to detect uses of undefined values that could adversely affect a program'sbehaviour. Under Memcheck, programs typically run 20-30 times slower than normal. This is fast enough to use with largeprograms. Memcheckfinds manyerrors in real pro- grams, and has been used during the past two years by thousands of programmers on a wide range of systems, including OpenOffice, Mozilla, Opera, KDE, GNOME, MySQL, Perl, Samba, The GIMP, and Unreal Tourna- ment.
Conference Paper
An abstract type groups variables that are used for related purposes in a program. We describe a dynamic unification-based analysis for inferring abstract types. Initially, each run-time value gets a unique abstract type. A run-time interaction among values indicates that they have the same abstract type, so their abstract types are unified. Also at run time, abstract types for variables are accumulated from abstract types for values. The notion of interaction may be customized, permitting the analysis to compute finer or coarser abstract types; these different notions of abstract type are useful for different tasks. We have implemented the analysis for compiled x86 binaries and for Java bytecodes. Our experiments indicate that the inferred abstract types are useful for program comprehension, improve both the results and the run time of a follow-on program analysis, and are more precise than the output of a comparable static analysis, without suffering from overfilling.
Article
We present a new approach for tracking programs' use of data througharbitrary calculations, to determine how much information about secretinputs is revealed by public outputs. Using a fine-grained dynamicbit-tracking analysis, the technique measures the information revealedduring a particular execution. The technique accounts for indirectflows, e.g. via branches and pointer operations. Two kinds ofuntrusted annotation improve the precision of the analysis. Animplementation of the technique based on dynamic binary translation isdemonstrated on real C, C++, and Objective C programs of up to half amillion lines of code. In case studies, the tool checked multiplesecurity policies, including one that was violated by a previouslyunknown bug.
Article
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004. Includes bibliographical references (p. 293-306). This thesis addresses the challenges of building a software system for general-purpose runtime code manipulation. Modern applications, with dynamically-loaded modules and dynamically-generated code, are assembled at runtime. While it was once feasible at compile time to observe and manipulate every instruction--which is critical for program analysis, instrumentation, trace gathering, optimization, and similar tools--it can now only be done at runtime. Existing runtime tools are successful at inserting instrumentation calls, but no general framework has been developed for fine-grained and comprehensive code observation and modification without high overheads. This thesis demonstrates the feasibility of building such a system in software. We present DynamoRIO, a fully-implemented runtime code manipulation system that supports code transformations on any part of a program, while it executes. DynamoRIO uses code caching technology to provide efficient, transparent, and comprehensive manipulation of an unmodified application running on a stock operating system and commodity hardware. DynamoRIO executes large, complex, modern applications with dynamically-loaded, generated, or even modified code. Despite the formidable obstacles inherent in the IA-32 architecture, DynamoRIO provides these capabilities efficiently, with zero to thirty percent time and memory overhead on both Windows and Linux. DynamoRIO exports an interface for building custom runtime code manipulation tools of all types. It has been used by many researchers, with several hundred downloads of our public release, and is being commercialized in a product for protection against remote security exploits, one of numerous applications of runtime code manipulation. by Derek L. Bruening. Ph.D.
Article
Thesis (Ph.D., Division of Engineering and Applied Sciences)--Harvard University, 2004. Includes bibliographical references (leaves 132-140).
Article
Execution Shadow processing was motivated, in part, by a tool called AE that supports abstract execution [17]. AE is used for efficient generation of detailed program traces. A source program, in C, is instrumented to record a small set of key events during execution. After execution these events serve as input to an abstract version of the original program that can recreate a full trace of the original program. The events recorded by the original program include control flow decisions. These are essentially the same data needed by a shadow process to follow a main process. AE is a post-run technique that shifts some of the costs involved in tracing certain incidents during a program's execution to the program that uses those incidents. In contrast, shadow processing is a run-time technique that removes expensive tracing from the critical execution path of a program and shifts it to another processor. Table 7: Concurrent Guarding using Shadow Processing: (user + system) time Program ...
Article
A linear-scan algorithm directs the global allocation of register candidates to registers based on a simple linear sweep over the program being compiled. This approach to register allocation makes sense for systems, such as those for dynamic compilation, where compilation speed is important. In contrast, most commercial and research optimizing compilers rely on a graph-coloring approach to global register allocation. In this paper, we compare the performance of a linear-scan method against a modern graph-coloring method. We implement both register allocators within the Machine SUIF extension of the Stanford SUIF compiler system. Experimental results show that linear scan is much faster than coloring on benchmarks with large numbers of register candidates. We also describe improvements to the linear-scan approach that do not change its linear character, but allow it to produce code of a quality near to that produced by graph coloring. Keywords: global register allocation, graph colorin...
Article
are running. The overhead of monitoring and modifying a running program's instructions is often substantial in SDT. As a result SDT can be impractically slow, especially in SDT systems that do not or can not employ dynamic optimization to offset overhead. This is unfortunate since SDT has obvious advantages in modern computing environments and interesting applications of SDT continue to emerge. In this paper we introduce two novel overhead reduction techniques that can improve SDT performance by a factor of three even when no dynamic optimization is performed. To demonstrate the effectiveness of our overhead reduction techniques, and to show the type of useful tasks to which low-overhead, non-optimizing SDT systems might be put, we implemented two dynamic safety checkers with SDT. These dynamic safety checkers perform useful tasks--preventing buffer-overrun exploits and restricting system call usage in untrusted binaries. Further their performance is similar to, and in some cases better than, state-of-the-art tools that perform the same functions without SDT.