Large, production quality distributed systems still fail pe-riodically, and do so sometimes catastrophically, where most or all users experience an outage or data loss. We present the result of a comprehensive study investigat-ing 198 randomly selected, user-reported failures that oc-curred on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop MapReduce, and Redis, with the goal of understanding how one or multiple faults eventually evolve into a user-visible failures. We found that from a testing point of view, almost all failures re-quire only 3 or fewer nodes to reproduce, which is good news considering that these services typically run on a very large number of nodes. However, multiple inputs are needed to trigger the failures with the order between them being important. Finally, we found the error logs of these systems typically contain sufficient data on both the errors and the input events that triggered the failure, enabling the diagnose and the reproduction of the pro-duction failures – often with unit tests. We found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code – the last line of defense – even without an understanding of the software design. We ex-tracted three simple rules from the bugs that have lead to some of the catastrophic failures, and developed a static checker, Aspirator, capable of locating these bugs. Over 30% of the catastrophic failures would have been pre-vented had Aspirator been used and the identified bugs fixed. Running Aspirator on the code of 9 distributed sys-tems located 143 bugs and bad practices that have been fixed or confirmed by the developers.
Figures - uploaded by
Michael StummAuthor contentAll figure content in this area was uploaded by Michael Stumm
Content may be subject to copyright.