Conference PaperPDF Available

Cognitive demands and activities in dynamic fault management: abductive reasoning and disturbance management

Authors:
... The brief episode illustrates some of the forms of cognitive work in anomaly response (Woods, 1994;Woods and Hollnagel, 2006) and how these are carried out in the domain of web engineering and operations (Allspaw, 2015;Allspaw and Cook, 2018). ...
... Anomaly response has been studied in a variety of domains where dynamic fault management is a core activity. Examples include nuclear power plants, commercial aviation, space flight systems, and surgical operating theaters (Woods et al., 1994;Abbott, 1990;Patterson et al., 1999;Cook et al., 1993). Practitioners in these areas exhibit diagnostic behaviors alongside therapeutic interventions, relating to the abductive reasoning model (Woods, 1994). ...
... Examples include nuclear power plants, commercial aviation, space flight systems, and surgical operating theaters (Woods et al., 1994;Abbott, 1990;Patterson et al., 1999;Cook et al., 1993). Practitioners in these areas exhibit diagnostic behaviors alongside therapeutic interventions, relating to the abductive reasoning model (Woods, 1994). As anomalies cascade across a system, responders must keep pace based on their best explanation of the phenomenon given available knowledge of the situation under their limited perspective while ensuring the system continues to function. ...
Thesis
Full-text available
This thesis captures patterns and challenges in anomaly response in the domain of web engineering and operations by analyzing a corpus of four actual cases. Web production software systems operate at an unprecedented scale today, requiring extensive automation to develop and maintain services. The systems are designed to regularly adapt to dynamic load to avoid the consequences of overloading portions of the network. As the software systems scale and complexity grows, it becomes more difficult to observe, model, and track how the systems function and malfunction. Anomalies inevitably arise, challenging groups of responsible engineers to recognize and understand anomalous behaviors as they plan and execute interventions to mitigate or resolve the threat of service outages. The thesis applies process tracing techniques to a corpus of four cases and extends them to capture the interplay between the human and machine agents when anomalies arise in web operations. The cases were elicited from expert practitioners dealing with anomalies that risked saturating the capacity of the system to handle continuing load across multiple elements in the interconnected network. The analysis is based on a framework distinguishing parts of the system above the line of representation, the human engineers and operators, from the automated processes and components below the computer interfaces – the Above the Line / Below the Line framework (ABL). The analysis of the incidents directly links the cascade of disturbances below the line with the cognitive work of anomaly response above the line. Recorded digital text-based communications were artifacts used to construct several new representations of the cases from two perspectives: 1) tracing the evolving hypotheses from anomalous signs and interventions, and 2) charting the four basic coping strategies used in response to mitigating overload. The hypothesis generation timelines for the cases supported findings about the importance of updating mental models during the incident response and that the activity happened explicitly and frequently between the distributed engineers. Diverse perspectives expanded the hypotheses considered and beneficially broadened the scope of investigation. Strange loops and weak representations focusing on primitive event changes hindered observability for the responders. Effects at a distance from their driving source also complicated hypothesis exploration, which is a natural consequence of the network complexity. Furthermore, the response timelines demonstrated the tendency of the automation to respond with tactical, local strategies; whereas, the humans used a variety of mostly strategic responses to fill in the gaps left by the automation. New forms of tooling and monitoring could be designed to support diagnostic search across functional relationships, as well as broaden awareness of the hypothesis exploration space. The Above the Line / Below the Line (ABL) framework provided a beneficial frame of reference for analyzing the relevant parts of the system in each incident and could be the basis for future work in decision support tool design. The case study research demonstrated specific and general patterns for complications to the cognitive work of anomaly response by the autonomy in complex web operation systems.
... With the advent of the paperless cockpit philosophy and affordable technologies in aviation, Electronic Flight Bags (EFBs), electronic checklists and digital Quick Reference Handbooks (QRHs) are now part of every commercial aircraft cockpit. In the case of a system failure, the aircraft warning system draws the pilot's attention to a critical or urgent problem (Woods, 1994;Woods & Sarter, 2010). The fault message presented on the display prompts pilots to retrieve and action a digital non-normal checklist. ...
... The assumptions underlying QRH design do not always hold true, though. As aircraft become more robots than machines (Carim Jr., 2016), the possibility of complex anomalous behaviour increases exponentially (Woods, 1994), leaving pilots to cope with an increasing number of ill-structured technical faults. These are defined as problems that go beyond the QRH and warning system scope. ...
... The revisited anomaly management model proposed by Carim et al. (2016Carim et al. ( , 2020 is a further elaboration of the original proposition by Woods (1994) and . According to the revisited model, a fault, whether well-or ill-structured, is presented in terms of disturbances because of the lack of linear relationship between the fault cause and symptom (Woods, 1995;Watts-Perotti & Woods, 2007). ...
... The dynamic fault management model proposed by Woods (1994) and Woods and Hollnagel (2006) supports the description of the inherent complexity that operators face when dealing with abnormal and emergency situations in complex sociotechnical systems, such as cockpits, space missions, nuclear plants, and anaesthetic management under surgery (Watts et al. 1996;Watts-Perotti and Woods 2007). The model, also known as the anomaly management, explains how operators make sense of the situation, diagnose and act under ambiguity and uncertainty while managing the ongoing process . ...
... Indeed, disturbances impose additional cognitive demands on operators, who need to cope with them while maintaining the monitored process integrity (Christoffersen and Woods 2006). Therefore, operators manage the situation based on three iterative and concurrent event-driven cognitive processes influenced by abductive reasoning: anomaly recognition and assessment, diagnosis, and responses or courses of action (Woods 1994;Watts et al. 1996). Based on these premises, the model criticizes modern alarm systems for their failure both to precisely diagnose the problem and to distinguish the important signals against a noisy background (Woods 1995). ...
... Only a few studies, such as Singer and Dekker (2000) and Carim Jr. et al. (2016), have applied the original model to practical situations in aviation. Despite their contributions, none of them has discussed its theoretical suitability to aviation, even though Woods (1994) argued that the model is generic and is partially based on the studies of commercial aircraft (e.g., Abbott 1991). Moreover, a review of the model by Woods and Hollnagel (2006) did not add new evidence or concepts, thus, highlighting the need for an update. ...
Article
Full-text available
More than 20 years ago, Woods proposed a model that accounts for the inherent complexity faced by operators when managing abnormal and emergency situations in highly complex sociotechnical systems. The model was reviewed a decade later and only a few studies have applied it to aviation. This paper proposes adjustments to the original model, based on recent theoretical developments and empirical evidence on the anomaly management activity in aviation. The model was divided into five components; three of which—activity, types of reasoning involved, and resources—were revisited and further developed. The two other components—fault behaviour and unit of analysis—were not updated and only discussed in the aviation context. As a result, the revisited model descriptively clarifies how the activity of anomaly management emerges from the use of a wide repertoire of strategies, involving a spectrum of types of reasoning and a set of resources for action, which are not limited to those anticipated by designers, such as checklists and the warning system. An instantiation of the revisited model highlights the implications of false alarms, which trigger a cascade of disturbances that, in turn, requires adaptive strategies based on heuristics and analogies and supported by pilot’s experience. The revisited model can support a more accurate analysis of anomalous situations and the redesign of work systems to achieve a better performance.
... Anomaly response situations can become quite challenging as cascading effects and time pressures constrain the tasks of monitoring, diagnosis, and replanning. These activities are further complicated because operators must handle multiple interleaved tasks, consider multiple interacting goals, and be ready to revise assessments and plans as new evidence comes in or as the situation changes (Watts-Perotti and Woods, 2009;Woods, 1994). Schematic of anomaly-driven cognitive activities involved in disturbance management. ...
... Adapted from Woods, D.D., 1994. Cognitive demands and activities in dynamic fault management: abductive reasoning and disturbance management. ...
... In effect, these 'intelligent' machines create joint cognitive systems that distribute cognitive work across multiple agents (Woods, 1986;Roth, Bennett and Woods, 1987;Hutchins, 1990;Billings, 1991). It seems paradoxical but studies of the impact of automation reveal that design of automated systems is really the design of a new human-machine cooperative system (contrast many of the discussions about machine abductive reasoning with the observations about human and cooperative abduction in Woods, 1994). ...
... For example, consider the diagnostic situation in a multi-agent environment, when one notices an anomaly in the process they monitor (Woods, 1994). Is the anomaly an indication of an underlying fault, or does the anomaly indicate some activity by another agent in the system, unexpected by this monitor? ...
Chapter
Full-text available
Keynote at Cognitive Science Society Conference
... These skills must be complemented by cognitive knowledge-based competencies, what Caird described as building the functional fidelity, which are necessary for improvisational and adaptive decision making when events move beyond prescribed doctrine. Training scenarios must be designed that are complex enough to challenge these competencies at both task and team levels as situations move from textbook into exceptional cases (Woods & Hollnagel, 2006;Woods, 1994). Illustrative of the escalation principle (Woods & Hollnagel, 2006), as problems cascade, uncertainties place more demand on the cognitive and coordinative work of the agents in the system to respond. ...
Thesis
In many mission critical work domains effective scenario based training and observation are crucial to the success of complex socio-technical systems. Organizations employ many different approaches toward conducting these kinds of training sessions, but as recent high surprise disasters such as the Columbia loss and the 9.11 terrorist attacks indicate, there will always be new surprises that put that learning efficacy to the test. Anomalies will oc- cur, new failure conditions will challenge existing systems and organizations, and these challenges of adapting to surprise and to resilience are nowhere more evident than in these mission critical domains. However, a significant amount of the training conducted in these types of organizations is not about training to be surprised, rather, its about showing indi- vidual competency and that current training is effective. These large-scale socio-technical organizations could be more resilient if they effectively exploit the opportunities from these exercises to capture learning and facilitate a deeper understanding behind the cognitive work in the domain. This thesis proposes the learning laboratory as a support framework for such exercise de- sign. The learning lab framework serves as a general abstraction of CSE staged world study design and envisioning techniques and extends these approaches to cope with new scalability challenges to resilience. CSE has a long history of conducting research in com- plex domains utilizing effective staged and scaled world design techniques to support and illustrate the critical cognitive challenges of practitioners at work. The learning lab incor- porates a variety of these techniques into a common framework that can be applied to a variety of different types of exercises already being conducted in order to maximize organ- izational understanding and learning from scenario based observation exercises.
... The practitioners were quick to correct the physiologic, systemic threat even though they were unable to diagnose its source. This shift from diagnosis to what Woods (1988Woods ( , 1994 calls disturbance management is crucial in the operating room and in other domains to maintaining the system in a stable configuration to permit later diagnosis and correction of the underlying faults. ...
Chapter
Full-text available
We begin with an introduction to the complexity of error through several exemplar incidents taken from anesthesiology. Each of these incidents may be considered by some to contain one or more human errors. Careful examination of the incidents, however, reveals a more complicated story about human performance. The incidents provide a way to introduce some of the research results about the factors that affect human performance in complex settings such as medicine. Because the incidents are drawn from anesthesiology, most of the discussion is about human performance in the conduct of anesthesia, but the conclusions apply to other medical specialties and even to other domains. The second part of the chapter deals more generally with the failures of large, complex systems and the sorts of problems those who would analyze human performance in such systems must encounter. It is significant that the results from studies in medicine and other domains such as aviation and nuclear power plant operation are parallel and strongly reinforcing. The processes of cognition are not fundamentally different between practitioners in these domains, and the problems that practitioners are forced to deal with are quite similar. We should not be surprised that the underlying features of breakdowns in these large, complex systems are [also] quite similar.
... Time constraints and high risk may also added up, thus, increasing the difficulty of fault diagnosis. A particularly demanding situation is dynamic fault management (Woods 1994), whereby operators have to maintain system functions despite technical failures or disturbances. Typical fields of practice, where dynamic fault management occurs, are flightdeck operations, control of space systems, anesthetic management and process control. ...
Preprint
Full-text available
a set of 5 short articles on human performance and business critical software infrastructure including: 1. It’s time to revise our appreciation of the human side of Internet-facing software systems. 2. Above the Line, Below the Line. 3. Cognitive Work of Hypothesis Exploration during Anomaly Response. 4. Managing the Hidden Costs of Coordination. 5. Beyond the 'Fix-It' Treadmill.
ResearchGate has not been able to resolve any references for this publication.