[Show abstract][Hide abstract] ABSTRACT: Troubleshooting the performance of production software is challenging. Most existing tools, such as profiling, tracing, and logging systems, reveal what events occurred during performance anomalies. However, users of such toolsmust infer why these events occurred; e.g., that their execution was due to a root cause such as a specific input request or configuration setting. Such inference often requires source code and detailed application knowledge that is beyond system administrators and end users. This paper introduces performance summarization, a technique for automatically diagnosing the root causes of performance problems. Performance summarization instruments binaries as applications execute. It first attributes performance costs to each basic block. It then uses dynamic information flow tracking to estimate the likelihood that a block was executed due to each potential root cause. Finally, it summarizes the overall cost of each potential root cause by summing the per-block cost multiplied by the cause-specific likelihood over all basic blocks. Performance summarization can also be performed differentially to explain performance differences between two similar activities. X-ray is a tool that implements performance summarization. Our results show that X-ray accurately diagnoses 17 performance issues in Apache, lighttpd, Postfix, and PostgreSQL, while adding 2.3% average runtime overhead.
Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation; 10/2012
[Show abstract][Hide abstract] ABSTRACT: Software misconfigurations are time-consuming and enormously frustrating to troubleshoot. In this paper, we show that dynamic information flow analysis helps solve these problems by pinpointing the root cause of configuration errors. We have built a tool called ConfAid that instruments application binaries to monitor the causal dependencies introduced through control and data flow as the program executes -- ConfAid uses these dependencies to link the erroneous behavior to specific tokens in configuration files. Our results using ConfAid to solve misconfigurations in OpenSSH, Apache, and Postfix show that ConfAid identifies the source of the misconfiguration as the first or second most likely root cause for 18 out of 18 real-world configuration errors and for 55 out of 60 randomly generated errors. ConfAid runs in only a few minutes, making it an attractive alternative to manual debugging.
9th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2010, October 4-6, 2010, Vancouver, BC, Canada, Proceedings; 01/2010
[Show abstract][Hide abstract] ABSTRACT: We present a novel method for diagnosing config- uration management errors. Our proposed approach de- duces the state of a buggy computer by running predi- cates that test system correctness and comparing the re- sulting execution to that generated by running the same predicates on a reference computer. Our approach gen- erates signatures that represent the execution path of a predicate by recording the causal dependencies of its ex- ecution. Our results show that comparisons based on dependency sets significantly outperform comparisons based on predicate success or failure, uniquely identi- fying the correct bug 86-100% of the time. In the re- maining cases, the dependency set method identifies the correct bug as one of two equally likely bugs.
2008 USENIX Annual Technical Conference, Boston, MA, USA, June 22-27, 2008. Proceedings; 01/2008
[Show abstract][Hide abstract] ABSTRACT: AutoBash is a set of interactive tools that helps users and system administrators manage congur ations. AutoBash leverages causal tracking support implemented within our modie d Linux kernel to understand the inputs (causal de- pendencies) and outputs (causal ee cts) of congur ation ac- tions. It uses OS-level speculative execution to try possible actions, examine their ee cts, and roll them back when nec- essary. AutoBash automates many of the tedious parts of trying to x a miscongur ation, including searching through possible solutions, testing whether a particular solution xes a problem, and undoing changes to persistent and transient state when a solution fails. Our results show that AutoBash correctly identies the solution to several CVS, gcc cross- compiler, and Apache congur ation errors. We also show that causal analysis reduces AutoBash's search time by an average of 35% and solution veric ation time by an average of 70%.
Proceedings of the 21st ACM Symposium on Operating Systems Principles 2007, SOSP 2007, Stevenson, Washington, USA, October 14-17, 2007; 01/2007
[Show abstract][Hide abstract] ABSTRACT: Extreme transistor scaling trends in silicon technology are soon to reach a point where manufactured systems will suffer from limited device reliability and severely reduced life-time, due to early transistor failures, gate oxide wear-out, manufacturing defects, and radiation-induced soft errors (SER). In this paper we present a low-cost technique to harden a microprocessor pipeline and caches against these reliability threats. Our approach utilizes online built-in self-test (BIST) and microarchitectural checkpointing to detect, diagnose and recover the computation impaired by silicon defects or SER events. The approach works by periodically testing the processor to determine if the system is broken. If so, we reconfigure the processor to avoid using the broken component. A similar mechanism is used to detect SER, faults, with the difference that recovery is implemented by re-execution. By utilizing low-cost techniques to address defects and SER, we keep protection costs significantly lower than traditional fault-tolerance approaches while providing high levels of coverage for a wide range of faults. Using detailed gate-level simulation, we find that our approach provides 95% and 99% coverage for silicon defects and SER events, respectively, with only a 14% area overhead.
2007 Design, Automation and Test in Europe Conference and Exposition (DATE 2007), April 16-20, 2007, Nice, France; 01/2007