ArticlePDF Available

How complex systems fail

Authors:
  • Adaptive Capacity Labs

Abstract

1) Complex systems are intrinsically hazardous systems. All of the interesting systems (e.g. transportation, healthcare, power generation) are inherently and unavoidably hazardous by the own nature. The frequency of hazard exposure can sometimes be changed but the processes involved in the system are themselves intrinsically and irreducibly hazardous. It is the presence of these hazards that drives the creation of defenses against hazard that characterize these systems. 2) Complex systems are heavily and successfully defended against failure. The high consequences of failure lead over time to the construction of multiple layers of defense against failure. These defenses include obvious technical components (e.g. backup systems, 'safety' features of equipment) and human components (e.g. training, knowledge) but also a variety of organizational, institutional, and regulatory defenses (e.g. policies and procedures, certification, work rules, team training). The effect of these measures is to provide a series of shields that normally divert operations away from accidents. 3) Catastrophe requires multiple failures – single point failures are not enough..
How Complex Systems Fail
How Complex Systems Fail
(Being a Short Treatise on the Nature of Failure; How Failure is Evaluated; How Failure is
Attributed to Proximate Cause; and the Resulting New Understanding of Patient Safety)
Richard I. Cook, MD
1
Cognitive technologies Laboratory
University of Chicago
1) Complex systems are intrinsically hazardous systems.
All of the interesting systems (e.g. transportation, healthcare, power generation) are inherently and
unavoidably hazardous by the own nature. The frequency of hazard exposure can sometimes be
changed but the processes involved in the system are themselves intrinsically and irreducibly
hazardous. It is the presence of these hazards that drives the creation of defenses against hazard
that characterize these systems.
2) Complex systems are heavily and successfully defended against failure.
The high consequences of failure lead over time to the construction of multiple layers of defense
against failure. These defenses include obvious technical components (e.g. backup systems,
‘safety’ features of equipment) and human components (e.g. training, knowledge) but also a
variety of organizational, institutional, and regulatory defenses (e.g. policies and procedures,
certification, work rules, team training). The effect of these measures is to provide a series of
shields that normally divert operations away from accidents.
3) Catastrophe requires multiple failures – single point failures are not enough..
The array of defenses works. System operations are generally successful. Overt catastrophic
failure occurs when small, apparently innocuous failures join to create opportunity for a systemic
accident. Each of these small failures is necessary to cause catastrophe but only the combination is
sufficient to permit failure. Put another way, there are many more failure opportunities than overt
system accidents. Most initial failure trajectories are blocked by designed system safety
components. Trajectories that reach the operational level are mostly blocked, usually by
practitioners.
4) Complex systems contain changing mixtures of failures latent within them.
The complexity of these systems makes it impossible for them to run without multiple flaws being
present. Because these are individually insufficient to cause failure they are regarded as minor
factors during operations. Eradication of all latent failures is limited primarily by economic cost
but also because it is difficult before the fact to see how such failures might contribute to an
accident. The failures change constantly because of changing technology, work organization, and
efforts to eradicate failures.
5) Complex systems run in degraded mode.
A corollary to the preceding point is that complex systems run as broken systems. The system
continues to function because it contains so many redundancies and because people can make it
function, despite the presence of many flaws. After accident reviews nearly always note that the
system has a history of prior ‘proto-accidents’ that nearly generated catastrophe. Arguments that
these degraded conditions should have been recognized before the overt accident are usually
predicated on naïve notions of system performance. System operations are dynamic, with
components (organizational, human, technical) failing and being replaced continuously.
6) Catastrophe is always just around the corner.
Complex systems possess potential for catastrophic failure. Human practitioners are nearly always
in close physical and temporal proximity to these potential failures – disaster can occur at any time
and in nearly any place. The potential for catastrophic outcome is a hallmark of complex systems.
It is impossible to eliminate the potential for such catastrophic failure; the potential for such
failure is always present by the system’s own nature.
Current affiliation, The Ohio State University, Columbus, OH, USA.
1
Copyright © 1998 by Richard I. Cook, MD, all rights reserved Revision G (2018.06.01)
Page ! 1
How Complex Systems Fail
7) Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.
Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident. There are
multiple contributors to accidents. Each of these is necessary insufficient in itself to create an
accident. Only jointly are these causes sufficient to create an accident. Indeed, it is the linking of
these causes together that creates the circumstances required for the accident. Thus, no isolation of
the ‘root cause’ of an accident is possible. The evaluations based on such reasoning as ‘root cause’
do not reflect a technical understanding of the nature of failure but rather the social, cultural need
to blame specific, localized forces or events for outcomes.
2
8) Hindsight biases post-accident assessments of human performance.
Knowledge of the outcome makes it seem that events leading to the outcome should have
appeared more salient to practitioners at the time than was actually the case. This means that ex
post facto accident analysis of human performance is inaccurate. The outcome knowledge poisons
the ability of after-accident observers to recreate the view of practitioners before the accident of
those same factors. It seems that practitioners “should have known” that the factors would
“inevitably” lead to an accident. Hindsight bias remains the primary obstacle to accident
3
investigation, especially when expert human performance is involved.
9) Human operators have dual roles: as producers & as defenders against failure.
The system practitioners operate the system in order to produce its desired product and also work
to forestall accidents. This dynamic quality of system operation, the balancing of demands for
production against the possibility of incipient failure is unavoidable. Outsiders rarely acknowledge
the duality of this role. In non-accident filled times, the production role is emphasized. After
accidents, the defense against failure role is emphasized. At either time, the outsider’s view
misapprehends the operator’s constant, simultaneous engagement with both roles.
10) All practitioner actions are gambles.
After accidents, the overt failure often appears to have been inevitable and the practitioner’s
actions as blunders or deliberate willful disregard of certain impending failure. But all practitioner
actions are actually gambles, that is, acts that take place in the face of uncertain outcomes. The
degree of uncertainty may change from moment to moment. That practitioner actions are gambles
appears clear after accidents; in general, post hoc analysis regards these gambles as poor ones. But
the converse: that successful outcomes are also the result of gambles; is not widely appreciated.
11) Actions at the sharp end resolve all ambiguity.
Organizations are ambiguous, often intentionally, about the relationship between production
targets, efficient use of resources, economy and costs of operations, and acceptable risks of low
and high consequence accidents. All ambiguity is resolved by actions of practitioners at the sharp
end of the system. After an accident, practitioner actions may be regarded as ‘errors’ or ‘violations’
but these evaluations are heavily biased by hindsight and ignore the other driving forces,
especially production pressure.
12) Human practitioners are the adaptable element of complex systems.
Practitioners and first line management actively adapt the system to maximize production and
minimize accidents. These adaptations often occur on a moment by moment basis. Some of these
adaptations include: (1) Restructuring the system in order to reduce exposure of vulnerable parts to
failure. (2) Concentrating critical resources in areas of expected high demand. (3) Providing
pathways for retreat or recovery from expected and unexpected faults. (4) Establishing means for
Anthropological field research provides the clearest demonstration of the social construction of the notion of
2
‘cause’ (cf. Goldman L (1993), The Culture of Coincidence: accident and absolute liability in Huli, New York:
Clarendon Press; and also Tasca L (1990), The Social Construction of Human Error, Unpublished doctoral dissertation,
Department of Sociology, State University of New York at Stonybrook.
This is not a feature of medical judgements or technical ones, but rather of all human cognition about past events and
3
their causes.
Copyright © 1998 by Richard I. Cook, MD, all rights reserved Revision G (2018.06.01)
Page ! 2
How Complex Systems Fail
early detection of changed system performance in order to allow graceful cutbacks in production
or other means of increasing resiliency.
13) Human expertise in complex systems is constantly changing
Complex systems require substantial human expertise in their operation and management. This
expertise changes in character as technology changes but it also changes because of the need to
replace experts who leave. In every case, training and refinement of skill and expertise is one part
of the function of the system itself. At any moment, therefore, a given complex system will
contain practitioners and trainees with varying degrees of expertise. Critical issues related to
expertise arise from (1) the need to use scarce expertise as a resource for the most difficult or
demanding production needs and (2) the need to develop expertise for future use.
14) Change introduces new forms of failure.
The low rate of overt accidents in reliable systems may encourage changes, especially the use of
new technology, to decrease the number of low consequence but high frequency failures. These
changes maybe actually create opportunities for new, low frequency but high consequence
failures. When new technologies are used to eliminate well understood system failures or to gain
high precision performance they often introduce new pathways to large scale, catastrophic failures.
Not uncommonly, these new, rare catastrophes have even greater impact than those eliminated by
the new technology. These new forms of failure are difficult to see before the fact; attention is paid
mostly to the putative beneficial characteristics of the changes. Because these new, high
consequence accidents occur at a low rate, multiple system changes may occur before an accident,
making it hard to see the contribution of technology to the failure.
15) Views of ‘cause’ limit the effectiveness of defenses against future events.
Post-accident remedies for “human error” are usually predicated on obstructing activities that can
“cause” accidents. These end-of-the-chain measures do little to reduce the likelihood of further
accidents. In fact that likelihood of an identical accident is already extraordinarily low because the
pattern of latent failures changes constantly. Instead of increasing safety, post-accident remedies
usually increase the coupling and complexity of the system. This increases the potential number
of latent failures and also makes the detection and blocking of accident trajectories more difficult.
16) Safety is a characteristic of systems and not of their components
Safety is an emergent property of systems; it does not reside in a person, device or department of
an organization or system. Safety cannot be purchased or manufactured; it is not a feature that is
separate from the other components of the system. This means that safety cannot be manipulated
like a feedstock or raw material. The state of safety in any system is always dynamic; continuous
systemic change insures that hazard and its management are constantly changing.
17) People continuously create safety.
Failure free operations are the result of activities of people who work to keep the system within
the boundaries of tolerable performance. These activities are, for the most part, part of normal
operations and superficially straightforward. But because system operations are never trouble free,
human practitioner adaptations to changing conditions actually create safety from moment to
moment. These adaptations often amount to just the selection of a well-rehearsed routine from a
store of available responses; sometimes, however, the adaptations are novel combinations or de
novo creations of new approaches.
18) Failure free operations require experience with failure.
Recognizing hazard and successfully manipulating system operations to remain inside the
tolerable performance boundaries requires intimate contact with failure. More robust system
performance is likely to arise in systems where operators can discern the “edge of the envelope”.
This is where system performance begins to deteriorate, becomes difficult to predict, or cannot be
readily recovered. In intrinsically hazardous systems, operators are expected to encounter and
appreciate hazards in ways that lead to overall performance that is desirable. Improved safety
depends on providing operators with calibrated views of the hazards. It also depends on providing
Copyright © 1998 by Richard I. Cook, MD, all rights reserved Revision G (2018.06.01)
Page ! 3
How Complex Systems Fail
calibration about how their actions move system performance towards or away from the edge of
the envelope.
Other materials:
Cook & Woods (1994), “Operating at the Sharp End: The Complexity of Human Error,” in MS
Bogner, ed., Human Error in Medicine, Hillsdale, NJ; pp. 255-310.
Cook, Woods, & Miller (1998), A Tale of Two Stories: Contrasting Views of Patient Safety,
Chicago, IL: NPSF, (available as PDF file on the NPSF web site at www.npsf.org).
Woods & Cook (1999). Perspectives on Human Error: Hindsight Biases and Local Rationality. In
Durso, Nickerson, et al., eds., Handbook of Applied Cognition. (New York: Wiley) pp.
141-171.
Woods DD, Dekker Sidney, Cook RI, Johannesen L, Sarter N (2010). Behind Human Error 2nd
Edition. CRC Press.
2018 note: substance of the body text is unchanged from 1998 version; typos corrected where recognized;
Other Materials section updated to reflect current conditions.
Limited permission for non-commercial use is granted provided the document is reproduced in full with
copyright notice and without any alteration. For other uses, contact the author.
Copyright © 1998 by Richard I. Cook, MD, all rights reserved Revision G (2018.06.01)
Page ! 4
... Note that this estimand operates on the Observational rung of the Hierarchy of Causality, see Section 2.2.1. Eventually, RCA yields a ordered list of potential root cause variables along with their probabilities, which aligns with the way complex systems fail [9]. The variables that comprise the data are required to be representative enough to help the developers and engineers pinpoint the source of the observed problems through the root causes and their effects [41]. ...
Preprint
Full-text available
This paper describes the development of a causal diagnosis approach for troubleshooting an industrial environment on the basis of the technical language expressed in Return on Experience records. The proposed method leverages the vectorized linguistic knowledge contained in the distributed representation of a Large Language Model, and the causal associations entailed by the embedded failure modes and mechanisms of the industrial assets. The paper presents the elementary but essential concepts of the solution, which is conceived as a causality-aware retrieval augmented generation system, and illustrates them experimentally on a real-world Predictive Maintenance setting. Finally, it discusses avenues of improvement for the maturity of the utilized causal technology to meet the robustness challenges of increasingly complex scenarios in the industry.
... The trigger signal of the RCA is given by the failure timestamp (i.e., the point in time when the failure variable is observed). Then, RCA yields a list of potential root cause variables along with their probabilities, which aligns with the way complex systems fail [Cook, R. I., 2000]. The variables that comprise the data are required to be representative enough to help the developers and engineers pinpoint the source of the observed problems through the root causes and their effects [Weidl, G., Madsen, A. L., and Dahlquist, E., 2008]. ...
Preprint
Full-text available
This paper describes the development of a counterfactual Root Cause Analysis diagnosis approach for an industrial multivariate time series environment. It drives the attention toward the Point of Incipient Failure, which is the moment in time when the anomalous behavior is first observed, and where the root cause is assumed to be found before the issue propagates. The paper presents the elementary but essential concepts of the solution and illustrates them experimentally on a simulated setting. Finally, it discusses avenues of improvement for the maturity of the causal technology to meet the robustness challenges of increasingly complex environments in the industry.
Article
Full-text available
To dispel the authoritarian challenge to democracies, we need a programme of democratic improvement which affords worthwhile, and emotionally engaging connections to the meanings and values of democracy as a way of life. A fresh democratic dispensation must grapple with the critical importance of work, and specifically meaningful work. I will make use of recent scholarship characterising democracy as a complex adaptive system to identify how meaningful work contributes to the stability and effectiveness of democracy as a problem-solving device at multiple levels of organising. Furthermore, I outline how democracy as an ethical practice proliferates the range of meaningful lives, thereby making meaningful work the focus of public policy, as well as relevant to the character of work in public, private and civic organisations. I draw upon my writing and presentations on meaningful work, in particular: 'Meaningful Work and Workplace Democracy' (JBE paper and Palgrave Macmillan book), 'Meaningfulness and Organising for Sustainability' (Innsbruck lecture online), and "Attention and Meaning in the Democratic Group Life" (paper published by Brill, open access).
Chapter
Full-text available
The challenges we face as a global community are diverse, multifaceted, and interrelated. These include political polarization, economic challenges, ecological challenges as well as challenges resulting from demographic shifts and migration. There is a growing awareness that major disasters or “catastrophes” related to these challenges are all a function of how humans interact with their environment, but these are now increasingly predictable and controllable thanks to cutting-edge technologies such as big data and AI that are used in disaster preparedness, response, recovery, and mitigation. Moreover, consensus on addressing these challenges should be a priority, not by abandoning our unique beliefs and values, but fostering mutual respect, which can ultimately prevent division and reduce the likelihood of unforeseen catastrophes.
Article
Full-text available
The ways that risk assessments are commonly performed in organizations have limitations that undermine their quality. They typically focus on individual risk events one at a time but are weak at integrating their relevant causal context, into decision‐making processes. Network topology analysis has previously been applied to address this weakness through quantitatively characterizing the importance of the causal interactions of risk events. However, there remains a lack of both clarity and consistency in terminology, methods, and interpretation of the results of this approach. This paper presents and formalizes causal network topology analysis, a methodology that contributes to (1) characterizing the causal context of a risk event to inform its management, (2) articulating the ontological concepts underpinning a repeatable topology network analysis, and (3) justifying the selection and usage of network metrics for this purpose. The theory and methodology are discussed, and an exemplar application to a mining project feasibility study is presented.
Chapter
Products of the frugal type have attained prominence and diffused around the world (Rao, 2013) (Knorringa et al., 2016) (M. Hossain and Halme, 2016) (Pisoni et al., 2018) due to their potential to improve living standards of society at large. This is due to the affordability of quality products realized through restrictions on the consumption of resources for producing no-frills versions of electrocardiograms, cars and plasma technology, with more examples covered in the previous chapters. The examples showcased in chapters 3 and 4 cover frugal products of both the grassroots and advanced types in sectors including healthcare, automotive, astronomy and particle physics, to name a few.
Chapter
A summary, to buttress the reader’s understanding, of the content covered until this chapter is presented in this paragraph followed by the need for computing as an AFI.
Chapter
Full-text available
Organizations, to comply with regulations and growing prosocial demands, develop robust accountability infrastructures : offices, techno-legal experts, programs, operating procedures, technologies, and tools dedicated to keeping the organization’s operations in line with regulations and external standards. Although an organization has a single, unified accountability infrastructure—one program, one set of policies and procedures, and so on for environmental management, or health and safety, or risk management—this infrastructure must produce compliance across a dynamic, complex organization. This happens when and because compliance managers and officers make a single, unified accountability infrastructure multiple and diverse in its day-to-day implementation. This approach to compliance work is pragmatic in the sense that rules and requirements are altered based on a deep understanding of regulatory expectations, local operations, and local work cultures. It depends on the skilled interpretation and adaptation of regulation and narration of compliance.
Chapter
Full-text available
This introductory chapter combines several dimensions which are meant to help frame a complex topic representing a very rich diversity of situations across industries, countries, and epochs. The idea is to sensitize readers to several aspects associated with the topic of rules and autonomy in the domain of safety, and of this book. Its aim is to emphasize the importance of contexts when it comes to (safety) rules. Contexts refer to organizations, to industries, to risks, to histories, to practices, to situations, and to countries. Three sections develop the importance of context: (1) The advent of safety rules as an established narrative , (2) There is more than rules in safety , and (3) Historical trends … a bureaucratization of safety? The last section presents the chapters of this book, grouped in three categories, (1) Finding or losing the balance ; (2) The role, position, and influence of middle-managers and top management , finally; (3) When autonomy, initiative, and resilience take the lead .
Article
Human error is so often cited as a cause of accidents. There is perception of a 'human error problem'. Solutions are thought to lie in changing the people or their role. The label 'human error', however, is prejudicial and hides more than it reveals about how a system malfunctions. This book takes you behind the label. It explains how human error results from social and psychological judgments by the system's stakeholders that focus only on one facet of a set of interacting contributors. © David D. Woods, Sidney Dekker, Richard Cook, Leila Johannesen and Nadine Sarter 2010. All rights reserved.