PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

a set of 5 short articles on human performance and business critical software infrastructure including: 1. It’s time to revise our appreciation of the human side of Internet-facing software systems. 2. Above the Line, Below the Line. 3. Cognitive Work of Hypothesis Exploration during Anomaly Response. 4. Managing the Hidden Costs of Coordination. 5. Beyond the 'Fix-It' Treadmill.
Content may be subject to copyright.
A preview of the PDF is not available
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Thesis
This thesis captures patterns and challenges in anomaly response in the domain of web engineering and operations by analyzing a corpus of four actual cases. Web production software systems operate at an unprecedented scale today, requiring extensive automation to develop and maintain services. The systems are designed to regularly adapt to dynamic load to avoid the consequences of overloading portions of the network. As the software systems scale and complexity grows, it becomes more difficult to observe, model, and track how the systems function and malfunction. Anomalies inevitably arise, challenging groups of responsible engineers to recognize and understand anomalous behaviors as they plan and execute interventions to mitigate or resolve the threat of service outages. The thesis applies process tracing techniques to a corpus of four cases and extends them to capture the interplay between the human and machine agents when anomalies arise in web operations. The cases were elicited from expert practitioners dealing with anomalies that risked saturating the capacity of the system to handle continuing load across multiple elements in the interconnected network. The analysis is based on a framework distinguishing parts of the system above the line of representation, the human engineers and operators, from the automated processes and components below the computer interfaces – the Above the Line / Below the Line framework (ABL). The analysis of the incidents directly links the cascade of disturbances below the line with the cognitive work of anomaly response above the line. Recorded digital text-based communications were artifacts used to construct several new representations of the cases from two perspectives: 1) tracing the evolving hypotheses from anomalous signs and interventions, and 2) charting the four basic coping strategies used in response to mitigating overload. The hypothesis generation timelines for the cases supported findings about the importance of updating mental models during the incident response and that the activity happened explicitly and frequently between the distributed engineers. Diverse perspectives expanded the hypotheses considered and beneficially broadened the scope of investigation. Strange loops and weak representations focusing on primitive event changes hindered observability for the responders. Effects at a distance from their driving source also complicated hypothesis exploration, which is a natural consequence of the network complexity. Furthermore, the response timelines demonstrated the tendency of the automation to respond with tactical, local strategies; whereas, the humans used a variety of mostly strategic responses to fill in the gaps left by the automation. New forms of tooling and monitoring could be designed to support diagnostic search across functional relationships, as well as broaden awareness of the hypothesis exploration space. The Above the Line / Below the Line (ABL) framework provided a beneficial frame of reference for analyzing the relevant parts of the system in each incident and could be the basis for future work in decision support tool design. The case study research demonstrated specific and general patterns for complications to the cognitive work of anomaly response by the autonomy in complex web operation systems.
Full-text available
Chapter
shows how one can go beyond spartan laboratory paradigms and study complex problem-solving behaviors without abandoning all methodological rigor / describes how to carry out process tracing or protocol analysis methods as a "field experiment" (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Full-text available
Thesis
The increasing complexity of software applications and architectures in Internet services challenge the reasoning of operators tasked with diagnosing and resolving outages and degradations as they arise. Although a growing body of literature focuses on how failures can be prevented through more robust and fault-tolerant design of these systems, a dearth of research explores the cognitive challenges engineers face when those preventative designs fail and they are left to think and react to scenarios that hadn’t been imagined. This study explores what heuristics or rules-of-thumb engineers employ when faced with an outage or degradation scenario in a business-critical Internet service. A case study approach was used, focusing on an actual outage of functionality during a high period of buying activity on a popular online marketplace. Heuristics and other tacit knowledge were identified, and provide a promising avenue for both training and future interface design opportunities. Three diagnostic heuristics were identified as being in use: a) initially look for correlation between the behaviour and any recent changes made in the software, b) upon finding no correlation with a software change, widen the search to any potential contributors imagined, and c) when choosing a diagnostic direction, reduce it by focusing on the one that most easily comes to mind, either because symptoms match those of a difficult-to-diagnose event in the past, or those of any recent events. A fourth heuristic is coordinative in nature: when making changes to software in an effort to mitigate the untoward effects or to resolve the issue completely, rely on peer review of the changes more than automated testing (if at all.)
Full-text available
Chapter
Generalizing the concepts of joint activity developed by Clark (1996), we describe key aspects of team coordination. Joint activity depends on interpredictability of the participants' attitudes and actions. Such interpredictability is based on common ground-pertinent knowledge, beliefs, and assumptions that are shared among the involved parties. Joint activity assumes a basic compact, which is an agreement (often tacit) to facilitate coordination and prevent its breakdown. One aspect of the Basic Compact is the commitment to some degree of aligning multiple goals. A second aspect is that all parties are expected to bear their portion of the responsibility to establish and sustain common ground and to repair it as needed. We apply our understanding of these features of joint activity to account for issues in the design of automation. Research in software and robotic agents seeks to understand and satisfy requirements for the basic aspects of joint activity. Given the widespread demand for increasing the effectiveness of team play for complex systems that work closely and collaboratively with people, observed shortfalls in these current research efforts are ripe for further exploration and study.
Chapter
The modern “system” is a constantly changing melange of hardware and software embedded in a variable world. Together, the hyperdistribution, fluctuant composition, constantly varying workload, and continuous modification of modern technology assemblies comprises a unique challenge to those who design, maintain, diagnose, and repair them. We are involved in exploring this challenge and trying to understand how people are able to keep our systems working and, in particular, how they make sense out of what is happening around them. What we find is both inspiring and worrisome. Inspiring because the studies reveal highly refined expertise in people and groups along with novel mechanisms for bringing that expertise to bear. Worrisome because the technology and organization are so often poorly configured to make this expertise effective.
Conference Paper
Cloud scale provides the vast resources necessary to replace failed components, but this is useful only if those failures can be detected. For this reason, the major availability breakdowns and performance anomalies we see in cloud environments tend to be caused by subtle underlying faults, i.e., gray failure rather than fail-stop failure. In this paper, we discuss our experiences with gray failure in production cloud-scale systems to show its broad scope and consequences. We also argue that a key feature of gray failure is differential observability: that the system's failure detectors may not notice problems even when applications are afflicted by them. This realization leads us to believe that, to best deal with them, we should focus on bridging the gap between different components' perceptions of what constitutes failure.
Book
The ethnographic study performed by Bruno Latour engaged him in the world of the scientific laboratory to develop an understanding of scientific culture through observations of their daily interactions and processes. Latour assumed a scientific perspective in his study; observing his participants with the "same cold, unblinking eye" that they use in their daily research activities. He familiarized himself with the laboratory by intense focus on "literary inscription", noting that the writing process drives every activity in the laboratory. He unpacked the structure of scientific literature to uncover its importance to scientists (factual knowledge), how scientists communicate, and the processes involved with generating scientific knowledge (use of assays, instrumentation, documentation). The introduction by Jonas Salk stated that Latour's study could increase public understanding of scientists, thereby decreasing the expectations laid on them, and the general fear toward them. [Teri, STS 901-Fall; only read Ch. 2]