Content uploaded by David D Woods
Author content
All content in this area was uploaded by David D Woods on Nov 18, 2014
Content may be subject to copyright.
Epilogue: Resilience
Engineering Precepts
Erik Hollnagel
David D. Woods
Safety is Not a System Property
One of the recurrent themes of this book is that safety is something a
system or an organisation does, rather than something a system or an
organisation has. In other words, it is not a system property that, once
having been put in place, will remain. It is rather a characteristic of how
a system performs. This creates the dilemma that safety is shown more
by the absence of certain events – namely accidents – than by the
presence of something. Indeed, the occurrence of an unwanted event
need not mean that safety as such has failed, but could equally well be
due to the fact that safety is never complete or absolute.
In consequence of this, resilience engineering abandons the search
for safety as a property, whether defined through adherence to standard
rules, in error taxonomies, or in ‘human error’ counts. By doing so it
acknowledges the danger of the reification fallacy, i.e., the tendency to
convert a complex process or abstract concept into a single entity or
thing in itself (Gould, 1981, p. 24). Seeing resilience as a quality of
functioning has two important consequences.
• We can only measure the potential for resilience but not resilience
itself. Safety has often been expressed by means of reliability,
measured as the probability that a given function or component
would fail under specific circumstances. It is, however, not enough
that systems are reliable and that the probability of failure is below
a certain value (cf. Chapter 16); they must also be resilient and have
the ability to recover from irregular variations, disruptions and
degradation of expected working conditions.
348 Resilience Engineering
• Resilience cannot be engineered simply by introducing more
procedures, safeguards, and barriers. Resilience engineering instead
requires a continuous monitoring of system performance, of how
things are done. In this respect resilience is tantamount to coping
with complexity (Hollnagel & Woods, 2005), and to the ability to
retain control.
Resilience as a Form of Control
A system is in control if it is able to minimise or eliminate unwanted
variability, either in its own performance, in the environment, or in
both. The link between loss of control and the occurrence of
unexpected events is so tight that a preponderance of the latter in
practice is a signature of the former. Unexpected events are therefore
often seen as a consequence of lost control. The loss of control is
nevertheless not a necessary condition for unexpected events to occur.
They may be due to other factors, causes and developments outside the
boundaries of the system.
An unexpected event can also be a precipitating factor for loss of
control and in this respect the relation to resilience is interesting.
Knowing that control has been lost is of less value than knowing when
control is going to be lost, i.e., when unexpected events are likely. In
fact, according to the definition of resilience (Chapter 1), the
fundamental characteristic of a resilient organisation is that it does not
lose control of what it does, but is able to continue and rebound
(Chapter 13).
In order to be in control it is necessary to know what has happened
(the past), what happens (the present) and what may happen (the
future), as well as knowing what to do and having the required
resources to do it. If we consider joint cognitive systems in general,
ranging from single individuals interacting with simple machines, such
as a driver in a car, to groups engaged in complex collaborative
undertakings, such as a team of doctors and nurses in the operating
room, it soon becomes evident that a number of common conditions
characterise how well they perform and when and how they lose
control, regardless of domains. These conditions are lack of time, lack of
knowledge, lack of competence, and lack of resources (Hollnagel & Woods,
2005, pp. 75-78).
States of Resilience 349
Lack of time may come about for a number of reasons such as
degraded functionality, inadequate or overoptimistic planning, undue
demands from higher echelons or from the outside, etc. Lack of time is,
however, quite often a consequence of lack of foresight since that
pushes the system into a mode of reactive responding. Knowing what
happens and being able to respond are not by themselves sufficient to
ensure control, since a system without anticipation is limited to purely
reactive behaviour. That inevitably incurs a loss of time, both because
the response must come after the fact and therefore be compensatory,
and because the resources to respond may not always be ready when
needed but first have to be marshalled. In consequence of that, a
system confined to rely on feedback alone will in most cases sooner or
later fall behind the pace of events and therefore lose control.
Knowledge is obviously important both for knowing what to
expect (anticipation) and for knowing what to look for or where to
focus next (attention, perception). The encapsulated experience is
sometimes referred to as the system’s ‘model of the world’ and must as
such be dynamic rather than static. Knowledge is, however, more than
just experience but also comprises the ability to go beyond experience,
to expect the unexpected and to look for more than just the obvious.
This ability, technically known as requisite imagination (Westrum, 1991;
Adamski & Westrum, 2003), is a sine qua non for resilience.
Competence and resources are both important for the system’s
ability to respond rationally.
12
The competence refers to knowing what
to do and knowing how to do it, whereas the resources refer to the
ability to do it. That the latter are essential is obvious from the fact that
control is easily lost if the resources needed to implement the intended
response are missing. This is, for instance, a common condition in the
face of natural disasters such as wildfires, earthquakes, and pandemics.
Figure E.1 illustrates three qualities that a system must have to be
able to remain in control, and therefore to be resilient, with time as a
fourth, dependent quality. The three main qualities are not linked in the
sense that anticipation precedes attention, which in turn precedes
response. Although this ordering in some sense will be present for any
12
Being rational is not used in the traditional, normative meaning, but rather as
the quality of being anti-entropic, cf. Hollnagel (2005).
350 Resilience Engineering
specific instance that is described or analysed, the whole point about
resilience is that these qualities must be exercised continuously. The
system must constantly be watchful and prepared to respond.
Additionally, it must constantly update its knowledge, competence and
resources by learning from successes and failures – its own as well as
those of others.
Figure E.1: Required qualities of a resilient system
It is interesting to note that Diamond (2005) in his book on how
societies collapse and go under, identifies three ‘stops on the road to
failure’ (p. 419). These are: (1) the failure to anticipate a problem before
it has arrived, (2) the failing to perceive a problem that has actually
arrived, and (3) the failure to attempt to solve a problem once it has
been perceived (rational bad behaviour). A society that collapses is
arguably an extreme case of lack of resilience, yet it is probably no
coincidence that we find the positive version of exactly the same
characteristics in the general descriptions of what a system – or even an
individual – needs to remain in control. A resilient system must have
the ability to anticipate, perceive, and respond. Resilience engineering
must therefore address the principles and methods by which these
qualities can be brought about.
States of Resilience 351
Readiness for Action
It is a depressing fact that examples of system failures are never hard to
find. One such case, which fortunately left no one harmed, occurred
during the editing of this book. As everybody remembers, a magnitude
9.3 earthquake occurred in the morning of December 26, 2004, about
240 kilometers south of Sumatra. This earthquake triggered a tsunami
that swept across the Indian Ocean killing at least 280,000 people. One
predictable consequence of this most tragic disaster was that coastal
regions around the world became acutely aware of the tsunami risk and
therefore of the need to implement well-functioning early warning
systems. In these cases there is little doubt about what to expect, what
to look for, and what to do. So when a magnitude 7.2 earthquake
occurred on June 14, 2005, about 140 kilometres off the town of
Eureka in California, the tsunami warning system was ready and went
into action.
As it happened, not one but two tsunami warning centres reacted.
The first warning, covering the US and Canadian west coast, came from
a centre in Alaska. Three minutes later a second warning was issued by
a centre in Hawaii. The second warning said that there was no risk of
tsunami, but excluded the west coast north of California from the
warning. Rescue workers, missing this small but significant detail, were
predictably confused (Biever & Hecht, 2005, p. 24).
Tsunami warnings are broadcast via radio by the US National
Oceanic and Atmospheric Administration (NOAA). Unfortunately,
some locations cannot receive the NOAA radio signals because they are
blocked by mountains. They are therefore contacted by phone from
Seattle. On the day in question, however, a phone line was down so
that the message did not get through, effectively leaving some areas
without warning. This glitch was, however, not noticed. As it happened,
the earthquake was of a type that could not give rise to a tsunami, and
the warning was cancelled after one hour.
This example illustrates a system that was not resilient, despite
being able to detect the risk in time. While precautions had been made
and procedures put in place, there was no awareness of whether they
actually worked and no understanding of what the actual conditions are.
The specific shortcoming was one of communication, issuing
inconsistent warnings and lacking feedback, and the consequence was a
partial lack of readiness to respond. Using the terminology proposed in
352 Resilience Engineering
Chapter 21, the communication failure meant that some districts did
not go into the required state of high alert, as a preparation for an
evacuation. While the tsunami warning system was designed to look for
specific factors in the environment, it was not designed to look at itself,
to ensure that the ‘internal’ functions worked. The system was designed
to be safe by means of all the technology and procedures that were put
in place, but it was not designed to be resilient.
Why Things Go Wrong
It is a universal experience that things sooner or later will go wrong,
13
and fields such as risk analysis and human reliability assessment have
developed a plethora of method to help us predict when and how it
may happen. From the point of view of resilience engineering it is,
however, at least as important to understand why things go wrong. One
expression of this is found in the several accident theories that have
been proposed over the years (e.g., Hollnagel, 2004), not least the many
theories of ‘human error’ and organisational failure. Most such efforts
have been engrossed with the problems found in technical or socio-
technical systems. Yet there have also been attempts to look at the
larger issues, most notably Robert Merton’s lucid analysis of why social
actions often have unanticipated consequences (Merton, 1936).
It is almost trivial to say that we need a model, or a frame of
reference, to be able to understand issues such as safety and resilience
and to think about how safety can be ensured, maintained, and
improved. A model helps us to determine which information to look
for and brings some kind of order into chaos by providing the means
by which relationships can be explained. This obviously applies not
only to industrial safety, but to every human endeavour and industry.
To do so, the model must in practice fulfil two requirements. First, that
13
This is often expressed in terms of Murphy’s law, the common version of
which is that ‘everything that can go wrong, will’. A much earlier version is
Spode’s law, which says that ‘if something can go wrong, it will.’ It is named
after the English potter Josiah Spode (1733-97) who became famous for
perfecting the transfer printing process and for developing fine bone china –
but presumably not without many failed attempts on the way.
States of Resilience 353
it provides an explanation or brings about an understanding of an event
such that effective mitigating actions can be devised. Second, that it can
be used with a reasonable investment of effort – intellectual effort, as
well as time and resources. A model that is cumbersome and costly to
use will, from an academic point of view, from the very start be at a
disadvantage, even if it provides a better explanation.
14
The trick is
therefore to find a model that at the same time is so simple that it can
be used without engendering problems or requiring too much
specialised knowledge, yet powerful enough to go beneath the often
deceptive surface descriptions.
The problem with any powerful model is that it very quickly
becomes ‘second nature’, which means that we no longer realise the
simplifications it embodies. This should, however, not lead to the
conclusions that we must give up on models and try to describe reality
as it really is, since that is a philosophically naïve notion. The
consequence is rather that we should acknowledge the simplifications
that the models bring, and carefully weigh advantages against
disadvantages so that a choice of model is made knowingly.
Several models have been mentioned in the chapters in this book.
The most important models in the past have been the Domino model
and the Swiss cheese model (Chapter 1). Both are easy to comprehend
and have been immensely helpful in improving the understanding of
accidents. Yet their simplicity also means that some aspects cannot be
easily described – or described at all, and that explanations in terms of
the models therefore may be incomplete. (Strictly speaking, the two
models are metaphors rather than models. In one, accidents are likened
to a row of dominoes falling, and in the other, to harmful influences
passing through a series of holes aligned.)
In the case of the Domino model, it is clear that the real world has
no domino pieces waiting to fall. There may be precariously poised
14
‘Better’ is, of course, a rather dangerous term to use since it implies that some
objective criterion or standard is available. Although there is no truth to be
used as a point of reference, it is possible to show that one explanation –
under given conditions – may be better than another, e.g., in providing more
effective countermeasures. Changes are, however, never contemplated sub specie
aeternatis but are always subject to often very mundane or pecuniary
considerations.
354 Resilience Engineering
systems or subsystems that suddenly may change from a normal to an
abnormal state, but that transition is rarely as simple as a domino
falling. Likewise, the linking or coupling between dominoes is never as
simple as the model shows. Similarly, the Swiss cheese model does not
suggest that we should look for slices of cheeses or holes, or measure
the size of holes or movements of slices of cheese. The Swiss cheese
model rather serves to emphasise the importance of latent conditions
and illustrate how these in combination with active failures may lead to
accidents.
The Domino and Swiss cheese models are useful to explain the
abrupt, unexpected onset of accidents, but have problems in accounting
for the gradual loss of safety that may also lead to accidents. In order to
overcome this problem, a model of ‘drift to danger’ has been used, for
example in Chapter 3. Although the metaphor of drift introduces an
important dynamic aspect, it should not be taken literally or as a model,
for the following reasons:
• Since the boundaries or margins only exist in a metaphorical sense
or perhaps as emergent descriptions (Cook & Rasmussen, 2005),
there is really no way in which an organisation can ‘sail close’ to an
area of danger, nor ways in which the ‘distance’ can be measured.
‘Drift’ then only refers to how a series individual actions or
decisions have larger, combined and longer term impacts on system
properties that are missed or underappreciated.
• The metaphor itself oversimplifies the situation by referring to the
organisation as a whole. There is ample practical experience to
show that some parts of an organisation may be safe while others
may be unsafe. In other words, parts of the organisation may ‘drift’
in different directions. The safety of the organisation can
furthermore not be derived from a linear combination of the parts,
but rather depends on the ways in which they are coupled and how
coordination across these parts is fragmented or synchronized (cf.
Perrow, 1984). This is also the reason why accidents in a very
fundamental sense are non-linear phenomena.
• Finally, there are no external forces that, like the wind, push an
organisation in some direction, or allow the ‘captain’ to steer it clear
of danger. What happens is rather that choices and decisions made
during daily work may have long-term consequences that are not
States of Resilience 355
considered at the time. There can be many reasons for this, such as
the lack of proper ‘conceptual’ tools or a shortage of time.
It is inevitable that organisational practices change as part of daily
work, one simple reason being that the environment is partly
unpredictable, changing, or semi-erratic. Such changes are needed either
for purposes of safety or efficiency, though mostly the latter. Indeed,
the most important factor is probably the need to gain time in order to
prevent control from being lost, as described by the efficiency-
thoroughness trade-off (ETTO; Hollnagel, 2004). There is never
enough time to be sufficiently thorough; finishing an activity in time
may be important for other actions or events, which in turn cannot be
postponed because yet others depend on them, etc. The reality of this
tight coupling is probably best illustrated by the type of industrial action
that consists in ‘working to rule.’ This also provides a powerful
demonstration of how important the everyday trade-offs and shortcuts
are for the normal functioning of a system.
Changed practices to improve efficiency often have long-term
consequences that affect safety, although for one reason or another
they are disregarded when the changes are made. These consequences
are usually both latent and have latency and therefore only show
themselves after a while. Drift is therefore nothing more than an
accumulated effect of latent consequences, which in turn result from
the trade-off or sacrificing decisions that are required to keep the
system running.
A Constant Sense of Unease
Sacrificing decisions take place on both the individual and the
organisation levels – and even on the level of societies. While they are
necessary to cope with a partly unpredictable environment, they
constitute a source of risk when they become entrenched in
institutional or organisational norms. When trade-offs and sacrificing
decisions become habitual, they are usually forgotten. Being alert or
critical incurs a cost that no individual or organisation can sustain
permanently and is therefore only used when necessary. Norms qua
norms are for that reason rarely challenged. Yet it is important for
resilience that norms remain conspicuous, not in the sense that they
356 Resilience Engineering
must constantly be scrutinised and revised, but in the sense that their
existence is not forgotten and their assumptions take for granted.
Resilience requires a constant sense of unease that prevents
complacency. It requires a realistic sense of abilities, of ‘where we are’.
It requires knowledge of what has happened, what happens, and what
will happen, as well as of what to do. A resilient system must be
proactive; flexible; adaptive; and prepared. It must be aware of the
impact of actions, as well as of the failure to take action.
Precepts
The purpose of this book has been to propose resilience engineering as
a step forward from traditional safety engineering techniques – such as
those developed in risk analysis and probabilistic safety assessment
(PSA). Rather than try to force adaptive processes and organisational
factors into these families of measures and methods, resilience
engineering recognises the need to study safety as a process, provide
new measures, new ways to monitor systems, and new ways to
intervene to improve safety. Thinking in terms of resilience shifts
inquiry to the nature of the ‘surprises’ or types of variability that
challenge control.
• If ‘surprises’ are seen as disturbances, or disrupting events, which
challenge the proper functioning of a process, then inquiry centres
on how to keep a process under control in the face of such
disrupting events, specifically on how to ensure that people do not
exceed given ‘limits.’
• If ‘surprises’ are seen as uncertainty about the future, then inquiry
centres on developing ways to improve the ability to anticipate and
respond when so challenged.
• If ‘surprises’ are seen as recognition of the need constantly to
update definitions of the difference between success and failure,
then inquiry centres on the kinds of variations which our systems
should be able to handle and ways constantly to test the system’s
ability to handle these classes of variations.
• If ‘surprises’ are seen as recognition that models and plans are likely
to be incomplete or wrong, despite our best efforts, then inquiry
States of Resilience 357
centres on the search for the boundaries of our assessments in
order to learn and revise.
Resilience engineering entails a shift from an over-reliance on
analysis techniques to adaptive and co-adaptive models and measures as
the basis for safety management. Just as it acknowledges and tries to
avoid the risks of reification (cf. above), it also acknowledges and tries
to avoid the risks of oversimplifications, such as:
• working from static snapshots, rather than recognising that safety
emerges from dynamic processes;
• looking for separable or independent factors, rather than examining
the interactions across factors; and
• modelling accidents as chains of causality, rather than as the result
of tight couplings and functional resonance.
It is fundamental for resilience engineering to monitor and learn
from the gap between work as imagined and work as practised.
Anything that obscures this gap will make it impossible for the
organisation to calibrate its understanding or model of itself and
thereby undermine processes of learning and improvement.
Understanding what produces the gap can drive learning and
improvement and prevent dependence on local workarounds or
conformity with distant policies. There was universal agreement across
the symposium attendees that previous research supports the above as a
critical first principle. The practical problem is how to monitor this gap
and how to channel what is learned into organisational practice.
The Way Ahead
This book boldly asserts that sufficient progress has been made on
resilience as an alternative safety management paradigm to begin to
deploy that knowledge in the form of engineering management
techniques. The essential constituents of resilience engineering are
already at hand. Since the beginning of the 1990s there has been a
growing evolution of the principles for organisational resilience and in
the understanding of the factors that determine human and
organisational performance. As a result, there is an appreciable basis for
358 Resilience Engineering
how to incorporate human and organisational risk in life cycle systems
engineering tools and how to build knowledge management tools that
proactively capture how human and organisational factors affect risk.
While additional studies can continue to document the role played
by adaptive processes for how safety is created in complex systems, this
book marks the beginning of a transition in resilience engineering from
research questions to engineering management tools. Such tools are
needed to improve the effectiveness and safety of organisations
confronted by high hazard and high performance demands. In
particular, we believe that further advances in the resilience paradigm
should occur through deploying the new measures and techniques in
partnership with management for actual hazardous processes. Such
projects will have the dual goals of simultaneously advancing the
research base on resilience and tuning practical measurement and
management tools to function more effectively in actual organisation
decision-making.