Content uploaded by Anders Sandberg

Author content

All content in this area was uploaded by Anders Sandberg

Content may be subject to copyright.

1

Probing the Improbable: Methodological Challenges for

Risks with Low Probabilities and High Stakes

Toby Ord, Rafaela Hillerbrand, Anders Sandberg

*

Some risks have extremely high stakes. For example, a worldwide pandemic or

asteroid impact could potentially kill more than a billion people. Comfortingly,

scientific calculations often put very low probabilities on the occurrence of such

catastrophes. In this paper, we argue that there are important new

methodological problems which arise when assessing global catastrophic risks

and we focus on a problem regarding probability estimation. When an expert

provides a calculation of the probability of an outcome, they are really providing

the probability of the outcome occurring, given that their argument is

watertight. However, their argument may fail for a number of reasons such as a

flaw in the underlying theory, a flaw in the modeling of the problem, or a

mistake in the calculations. If the probability estimate given by an argument is

dwarfed by the chance that the argument itself is flawed, then the estimate is

suspect. We develop this idea formally, explaining how it differs from the

related distinctions of model and parameter uncertainty. Using the risk estimates

from the Large Hadron Collider as a test case, we show how serious the problem

can be when it comes to catastrophic risks and how best to address it.

1. Introduction

Large asteroid impacts are highly unlikely events.

1

Nonetheless, governments spend

large sums on assessing the associated risks. It is the high stakes that make these

otherwise rare events worth examining. Assessing a risk involves consideration of

both the stakes involved and the likelihood of the hazard occurring. If a risk

threatens the lives of a great many people it is not only rational but morally

imperative to examine the risk in some detail and to see what we can do to reduce it.

This paper focuses on low-probability high-stakes risks. In section 2, we show that

the probability estimates in scientific analysis cannot be equated with the likelihood

of these events occurring. Instead of the probability of the event occurring, scientific

analysis gives the event’s probability conditioned on the given argument being

sound. Though this is the case in all probability estimates, we show how it becomes

crucial when the estimated probabilities are smaller than a certain threshold.

To proceed, we need to know something about the reliability of the argument. To do

so, risk analysis commonly falls back on the distinction between model and

parameter uncertainty. We argue that this dichotomy is not well suited for

*

Future of Humanity Institute, University of Oxford.

1

Experts estimate the annual probability as approximately one in a billion (Near-Earth Object

Science Definition Team 2003).

2

incorporating information about the reliability of the theories involved in the risk

assessment. Furthermore the distinction does not account for mistakes made

unknowingly. In section 3, we therefore propose a three-fold distinction between an

argument’s theory, its model, and its calculations. While explaining this distinction

in more detail, we illustrate it with historic examples of errors in each of the three

areas. We indicate how specific risk assessment can make use of the proposed

theory-model-calculation distinction in order to evaluate the reliability of the given

argument and thus improve the reliability of their probability estimate for rare

events.

Recently concerns have been raised that high-energy experiments in particle physics,

such as the RHIC (Relativistic Heavy Ion Collider) at Brookhaven National

Laboratory or the LHC (Large Hadron Collider) at CERN, Geneva, may threaten

humanity. If these fears are justified, these experiments pose a risk to humanity that

can be avoided by simply not turning on the experiment. In section 4, we use the

methods of this paper to address the current debate on the safety of experiments

within particle physics. We evaluate current reports in the light of our findings and

give suggestions for future research.

The final section brings the debate back to the general issue of assessing low-

probability risk. We stress that the findings in this paper are not to be interpreted as

an argument for anti-intellectualism, but rather as arguments for making the noisy

and fallible nature of scientific and technical research subject to intellectual

reasoning, especially in situations where the probabilities are very low and the stakes

very high.

2. Probability Estimates

Suppose you read a report which examines a potentially catastrophic risk and

concludes that the probability of catastrophe is one in a billion. What probability

should you assign to the catastrophe occurring? We argue that direct use of the

report’s estimate of one in a billion is naïve. This is because the report’s authors are

not infallible and their argument might have a hidden flaw. What the report has told

us is not the probability of the catastrophe occurring, but the probability of the

catastrophe occurring given that the included argument is sound. Even if the argument

looks watertight, the chance that it contains a critical flaw may well be much larger

than one in a billion. After all, in a sample of a billion apparently watertight

arguments you are likely to see many that have hidden flaws. Our best estimate of

the probability of catastrophe may thus end up noticeably higher than the report’s

estimate.

2

Let us use the following notation:

2

Scientific arguments are also sometimes erroneous due to deliberate fraud, however we

shall not address this particular concern in this paper.

3

X = the catastrophe occurs,

A = the argument is sound.

While we are actually interested in P(X), the report provides us only with an estimate

of P(X|A), since it can’t take into account the possibility that it is in error.

3

From the

axioms of probability theory, we know that P(X ) is related to P(X|A)!by the

following formula:

(1) P(X)!=!P(X|A)!P(A)!+!P(X|¬A)!P(¬A) .

To use this formula to derive P(X) we would require estimates for the probability

that the argument is sound, P(A), and the probability of the catastrophe occurring

given that the argument is unsound, P(X|¬A). We are highly unlikely to be able to

acquire accurate values for these probabilities in practice but we shall see that even

crude estimates are enough to change the way we look at certain risk calculations.

A special case, which occurs quite frequently, is for reports to claim that X is

completely impossible. However, this just tells us that X is impossible given that all

our current beliefs are correct, i.e. that P(X|A)!=!0. By equation (1) we can see that

this is entirely consistent with P(X)!>!0, as the argument may be flawed.

Figure 1 is a simple graphical representation of our main point. The square on the

left represents the space of probabilities as described in the scientific report, where

the black area represents the catastrophe occurring and the white area represents it

not occurring. The normalized vertical axis denotes the probabilities for the event

occurring and not occurring. This representation ignores the possibility of the

argument being unsound. To accommodate this possibility, we can revise it in the

form of the square on the right. The black and white areas have shrunk in proportion

to the probability that the argument is sound and a new grey area represents the

possibility that the argument is unsound. Now the horizontal axis is also normalized

and represents the probability that the argument is sound.

3

An argument can take into account the possibility that a certain sub-argument is in error. For

example, it could offer two alternative sub-arguments to prove the same point. We encourage

such practice and look at an example in section 4. However, no argument can fully take into

account the possibility that it is itself is flawed — this would require an additional higher-

level argument.

4

Figure 1: The left panel depicts a report’s view on the probability of an event

occurring. The black area represents the chance of the event occurring, the white area

represents it not occurring. The right hand panel is the more accurate picture, taking

into account the possibility that the argument is flawed and that we thus face an grey

area containing an unknown amount of risk.

To continue our example, let us suppose that the argument made in the report looks

very solid, and that our best estimate of the probability that it is flawed is one in a

thousand, (P(¬A)!=!10

-3

). The other unknown term in equation!(1), P(X|¬A), is

generally even more difficult to evaluate, but lets suppose that in the current

example, we think it highly unlikely that the event will occur even if the argument is

not sound, and that we also treat this probability as one in a thousand. Equation (1)

tells us that the probability of catastrophe would then be just over one in a million —

an estimate which is a thousand times higher than that in the report itself. This

reflects the fact that if the catastrophe were to actually occur, it is much more likely

that this was because there was a flaw in the report’s argument than that a one in a

billion event took place.

Flawed arguments are not rare. One way to estimate the frequency of major flaws in

academic papers is to look at the proportion which are formally retracted after

publication. While some retractions are due to misconduct, most are due to

unintentional errors.

4

Using the MEDLINE database

5

(Cokol, Iossifov et al. 2007)

found a raw retraction rate of 6.3!!!10

-5

, but used a statistical model to estimate that

the retraction rate would actually be between 0.001 and 0.01 if all journals received

the same level of scrutiny as those in the top tier. This would suggest that P(¬A) >

0.001, making our earlier estimate rather optimistic. We must also remember that an

argument can easily be flawed without warranting retraction. Retraction is only

called for when the underlying flaws are not trivial and are immediately noticeable

by the academic community. The retraction rate for a field would thus provide a

lower bound for the rate of serious flaws. Of course, we must also keep in mind the

possibility that different branches of science may have different retraction rates and

different error rates. In particular, the hard sciences may be less prone to error than

the more applied sciences.

4

Between 1982 and 2002, 62% of retractions were due to unintentional errors rather than

misconduct (Nath, Marcus et al. 2006).

5

A very extensive database of biomedical research articles from over 5,000 journals.

5

It is important to note the particular connection between the present analysis and

high-stakes low-probability risks. While our analysis could be applied to any risk, it

is much more useful for those in this category. For it is only when P(X|A) is very low

that the grey area has a relatively large role to play. If P(X|A) is moderately high,

then the small contribution of the error term is of little significance in the overall

probability estimate, perhaps making the difference between 10% and 10.001% rather

than the difference between 0.001% and 0.002%. The stakes must also be very high to

warrant this additional analysis of the risk, for the adjustment to the estimated

probability will typically be very small in absolute terms. While an additional one in

a million chance of a billion deaths certainly warrants further consideration, an

additional one in a million chance of a house fire does not.

One might object to our approach on the grounds that we have shown only that the

uncertainty is greater than previously acknowledged, but not that the probability of

the event is greater than estimated: the additional uncertainty could just as well

decrease the probability of the event occurring. When applying our approach to

arbitrary examples, this objection would succeed; however in this article, we are

specifically looking at cases where there is an extremely low value of P(X|A), so

practically any value of P(X|¬A) will be higher and will thus drive the combined

probability estimate upwards. The situation is symmetric with regard to extremely

high estimates of P(X|A), where increased uncertainty about the argument will

reduce the probability estimate, the symmetry is broken only by our focus on

arguments which claim that an event is very unlikely.

Another possible objection is that since there is always a nonzero probability of the

argument being flawed, the situation is hopeless: any new argument will be unable

to remove the grey area completely. It is true that the grey area can never be

completely removed, however if a new argument (A

2

) is independent of the previous

argument (A

1

) then the grey area will shrink, for P(¬A

1

,!¬A

2

)!<!P(¬A

1

). This can

allow for significant progress. A small remaining grey area can be acceptable if

P(X|¬A)P(¬A) is estimated to be sufficiently small in comparison to the stakes.

3. Theories, Models and Calculations

The most common way to assess the reliability of an argument is to distinguish

between model and parameter uncertainty and assign reliabilities to these choices.

While this distinction has certainly been of use in many practical cases, it is

unnecessarily crude for the present purpose, failing to account for potential errors in

the paper’s calculations or a failure of the background theory.

In order to account for all possible mistakes in the argument, we look separately at

its theory, its model, and its calculations. The calculations evaluate a concrete model

representing the processes under consideration, e.g. the formation of black holes in a

particle collision, the response of certain climate parameters (such as mean

temperature or precipitation rate) to changes in greenhouse gas concentrations, or

the response of economies to changes in the oil price. These models are mostly

derived from more general theories. In what follows, we do not restrict the term

‘theory’ to well-established and mathematically elaborate theories like

6

electrodynamics, quantum chromodynamics or relativity theory. Rather, theories are

understood to include theoretical background knowledge such as specific research

paradigms or the generally accepted research practice within a field. An example is

the efficient market hypothesis which underlies many models within economics,

such as the Black-Scholes model.

Even incorrect theories and models can be useful, if their deviation from reality is

small enough for the purpose at hand. Hence we consider adequate models or

theories rather than correct ones. For example, we wish to allow that Newtonian

mechanics is an adequate theory in many situations, while recognizing that in some

cases it is clearly inadequate (such as for calculating the electron orbitals). We thus

call a representation of some system adequate if it is able to predict the relevant

system features at the required precision. For example, if climate modellers wish to

determine the implications our greenhouse gas emissions will have on the well-being

of future generations; their model/theory will not be adequate unless it tells them

the changes in the local temperature and precipitation. In contrast, a model might

only need to tell them changes in global temperature and precipitation to be adequate

for answering less sensitive questions. On a theoretical level, much more could be

said about this distinction between adequacy and correctness, but for the purposes of

evaluating the reliability of risk assessment, the explanation above should suffice.

With the following notation:

T = the involved theories are adequate

M = the derived model is adequate

C = the calculations are correct

we break down A in the way indicated above and replace P(X|A) in equation (1) by

P(X|T,M,C) and P(A) by P(T ,M,C ). From the laws of conditional probability it

follows that:

(2) P(T,M,C) = P(T)!P(M|T)!P(C|M,T)

We may assume C to be independent of M and T, as the correctness of a calculation is

independent of whether the theoretical and model assumptions underpinning it

were adequate. Given this independence, P(C|M,T) = P(C), so the above equation

can be simplified:

(3) P(T,M,C) = P(T) P(M|T) P(C).

Substituting this back into equation (1), we obtain a more tractable formula for the

probability that the event in question occurs.

We have already made a rough attempt at estimating P(A) from the paper retraction

rates. Estimating P(T), P(M|T) and P(C) is more accurate and somewhat easier,

though still of significant difficulty. Though estimating the various terms in equation

(3) must ultimately be done on a case by case basis, the following elucidation of what

we mean by calculation, model and theory will shed some light on how to pursue

7

such an analysis. By incorporating our threefold distinction, it is straightforward to

apply findings on the reliability of theories from philosophy of science — based, for

example, on probabilistic verification methods (e.g. (Reichenbach 1938)) or

falsifications as in (Hempel 1950) or (Popper 1959). Often, however, the best we can

do is to put some bounds upon them based on the historical record. We thus review

typical sources of error in the three areas.

3.1. Calculation: Analytic and Numeric

Estimating the correctness of the calculation independently from the adequacy of the

model and the theory seems important whenever the mathematics involved is non-

trivial. Most cases where we are able to provide more than purely heuristic and

hand-waving risk assessments are of this sort. Consider climate models evaluating

runaway climate change and risk estimates for the LHC or for asteroid impacts.

When calculations accumulate, even trivial mathematical procedures become error-

prone. A particular difficulty arises due to the division of labour in the sciences:

commonly in modern scientific practice, various steps in a calculation are done by

different individuals who may be in different working groups in different countries.

The Mars Climate Observer spacecraft was lost in 1999 because a piece of control

software from Lockheed Martin used Imperial units instead of the metric units the

interfacing NASA software expected (NASA 1999).

Calculation errors are distressingly common. There are no reliable statistics on the

calculation errors made in risk assessment or, even more broadly, within scientific

papers. However, there is research on errors made in some very simple calculations

that performed in hospitals. Dosing errors give an approximate estimate of how

often mathematical slips occur. Errors in drug charts occur at a rate of 1.2% to 31%

across different studies (Prot, Fontan et al. 2005; Stubbs, Haw et al. 2006; Walsh,

Landrigan et al. 2008), with a median of roughly 5% of administrations. Of these

errors 15-40% were dose errors, giving an overall dose error rate of about 1–2%.

What does this mean for error rates in risk estimation? Since the stakes are high

when it comes to dosing errors, this data represents a serious attempt to get the right

answer in a life or death circumstance. It is likely that the people doing risk

estimation are more reliable at arithmetic than health professionals and have more

time for error correction, but it appears unlikely that they would be many orders of

magnitude more reliable. Hence a chance of 10

-3

for a mistake per simple calculation

does not seem unreasonable. A random sample of papers from Nature and the British

Medical Journal found that roughly 11% of statistical results were flawed, largely due

to rounding and transcription errors (García-Berthou and Alcaraz 2004).

Calculation errors include more than just the ‘simple’ slips which we know from

school, such as confusing units, forgetting a negative square root, or incorrectly

transcribing from the line above. Instead, many mistakes arise here due to numerical

implementation of the analytic mathematical equations. Computer based simulations

and numerical analysis are rarely straightforward. The history of computers contains

a large number of spectacular failures due to small mistakes in hardware or software.

The June 4 1996 explosion of an Ariane 5 rocket was due to a leftover piece of code

triggering a cascade of failures (ESA 1996). Audits of spreadsheets in real-world use

8

find error rates on the order of 88% (Panko 1998). The 1993 Intel Pentium floating

point error affected 3-5 million processors, reducing their numeric reliability and

hence our confidence in anything calculated with them (Nicely 2008). Programming

errors can remain dormant for a long time even in apparently correct code, only to

emerge under extreme conditions. An elementary and widely used binary search

algorithm included in the standard libraries for Java was found after nine years to

contain a bug that emerges only when searching very large lists (Bloch 2006). A

mistake in data-processing led to the retraction of five high-profile protein structure

papers as the handedness of the molecules had become inverted (Miller 2006).

In cases where computational methods are used in modelling, many mistakes cannot

be avoided. Discrete approximations of the often continuous model equations are

used, and in some cases we know that the discrete version is not a good proxy for the

continuous model (Morawetz and Walke 2003). Moreover, numerical evaluations are

often done on a discrete computational grid, with the values inside the meshes being

approximated from the values computed at the grid points. Though we know that

certain extrapolation schemes are more reliable in some cases than others, we are

often unable to exclude the possibility of error, or to even quantify it.

3.2 Ways of modelling and theorizing

Our distinction between model and theory follows the typical use of the terms within

mathematical sciences like physics or economics. Whereas theories are associated

with broad applicability and higher confidence in the correctness of their description,

models are closer to the phenomena. For example, when estimating the probability

of a particular asteroid colliding with the earth, one would use either Newtonian

mechanics or general relativity as a theory for describing the role of gravity. One

could then use this theory in conjunction with observations of the bodies’ positions,

velocities and masses to construct a model, and finally, one could perform a series of

calculations based on this model to estimate the probability of impact. As this shows,

the errors that can be introduced in settling for a specific model include and surpass

those which are sometimes referred to as parameter uncertainty. As well as questions

of the individual parameters (positions, velocities, masses), there are important

questions of detail (can we neglect the inner structure of the involved bodies?), and

breadth (can we focus on the Earth and asteroid only, or do we have to model other

planets, or the Sun?).

6

As can be seen from this example, one way to distinguish theories from models is

that theories are too general to be applied directly to the problem. For any given

theory, there are many ways to apply it to the problem and these ways give rise to

different models. Philosophers of science will note that our theory/model distinction

6

This question of breadth is closely linked to what (Hansson 1996) refers to as demarcation

uncertainty. But demarcation of the problem involves not only the obvious demarcation in

physical space and time, but also questions of the systems to consider, the scales to consider

etc.

9

is in accordance with the non-uniform notion used by (Giere 1999), (Morrison 1998),

(Cartwright 1999), and others, but differs from that of (Suppes 1957).

We should also note that it is quite possible for an argument to involve several

theories or several models. This complicates the analysis and typically provides

additional ways for the argument to be flawed.

7

For example, in estimating the risk

of black hole formation at the LHC, we not only require quantum chromodynamics

(the theory the LHC is built to test), but also relativity and Hawking’s theory of black

hole radiation. In addition to their other roles, modelling assumptions also have to

explain how to glue such different theories together (Hillerbrand and Ghil 2008).

In risk assessment, the systems involved are most often not as well understood as

asteroid impacts. Often, various models exist simultaneously — all known to be

incomplete or incorrect in some way, but difficult to improve upon. Particularly in

these cases, having an expected or desired outcome in mind while setting up a

model, makes one vulnerable to expectation bias: the tendency to reach the desired

answer rather than the correct one. This bias has affected many of science's great

names (Jeng 2006), and in the case of risk assessment, the desire for a ‘positive’

outcome (safety in the case of the advocate or danger in the case of the protestor)

seems a likely cause of bias in modelling.

Figure 2: Our distinctions regarding the ways in which risk assessments can be

flawed.

3.3 Historical examples of Model and Theory Failure

A dramatic example of a model failure was the Castle Bravo nuclear test on March 1

1954. The device achieved 15 megatons of yield instead of the predicted 4-8

megatons. Fallout affected parts of the Marshal Islands and irradiated a Japanese

fishing boat so badly that one fisherman died, causing an international incident

(Nuclear Weapon Archive 2006). Though the designers at Los Alamos National

Laboratories understood the involved theory of alpha decay, their model of the

reactions involved in the explosion was too narrow, for it neglected the decay of one

of the involved particles (lithium-7), which turned out to contribute the bulk of the

7

Additional theories and models can also be deliberately introduced in order to lower the

probability of argument failure, and in section 4, we shall see how this has been done for the

safety assessment of the LHC.

10

explosion’s energy. The Castle Bravo test is also notable for being an example of

model failure in a very serious experiment conducted in the hard sciences and with

known high stakes.

The history of science contains numerous examples of how generally accepted

theories have been overturned by new evidence or understanding, as well as a

plethora of minor theories that persisted for a surprising length of time before being

disproven. Classic examples for the former include the Ptolemaic system, phlogiston

theory and caloric theory; an example for the latter is human chromosome number,

which was systematically miscounted as 48 (rather than 46) and this error persisted

for more than 30 years (Gartler 2006).

As a final example, consider Lord Kelvin’s estimates of the age of the Earth

(Burchfield 1975). They were based on information about the earth’s temperature

and heat conduction, estimating an age of the Earth of between 20 and 40 million

years. These estimates did not take into account radioactive heating, for radioactive

decay was unknown at the time. Once it was shown to generate additional heat the

models were quickly updated. While neglecting radioactivity today would count as a

model failure, in Lord Kelvin’s day it represented a largely unsuspected weakness in

the physical understanding of the Earth and thus amounted to theory failure. This

example makes it clear that the probabilities for the adequacy of model and theory

are not independent of each other, and thus in the most general case we cannot

further decompose equation (3).

4. Applying our analysis to the risks from particle physics research

Particle physics is the study of the elementary constituents of matter and radiation,

and the interactions between them. A major experimental method in particle physics

involves the use of particle accelerators such as the RHIC and LHC to bring beams

of particles to near the speed of light and then collide them together. This focuses a

large amount of energy in a very small region and breaks the particles down into

their components, which are then detected. As particle accelerators have become

larger, the energy densities achieved have become more extreme, prompting some

concern about their safety. These safety concerns have focused on three possibilities:

the formation of ‘true vacuum’, the transformation of the earth into ‘strange matter’,

and the destruction of the earth through the creation of a black hole.

4.1 True vacuum and strange matter formation

The type of vacuum that exists in our universe might not be the lowest possible

vacuum energy state. In this case, the vacuum could decay to the lowest energy state,

either spontaneously, or if triggered by a sufficient disturbance. This would produce

a bubble of ‘true vacuum’ expanding outwards at the speed of light, converting the

universe into different state apparently inhospitable for any kind of life (Turner and

Wilczek 1982).

Our ordinary matter is composed of electrons and two types of quarks: up quarks

and down quarks. Strange matter also contains a third type of quark: the ‘strange’

quark. It has been hypothesized that strange matter might be more stable than

11

normal matter, and able to convert atomic nuclei into more strange matter (Witten

1984). It has also been hypothesized that particle accelerators could produce small

negatively charged clumps of strange matter, known as strangelets. If both these

hypotheses were correct and the strangelet also had a high enough chance of

interacting with normal matter, it would grow inside the Earth, attracting nuclei at

an ever higher rate until the entire planet was converted to strange matter —

destroying all life in the process. Unfortunately strange matter is complex and little

understood, giving models with widely divergent predictions about its stability,

charge and other properties (Jaffe, Busza et al. 2000).

One way of bounding the risk from these sources is the cosmic ray argument: the same

kind of high-energy particle collisions occur all the time in Earth’s atmosphere, on

the surface on the Moon and elsewhere in the universe. The fact that the Moon or

observable stars have not been destroyed despite a vast number of past collisions

(many at much higher energies than can be achieved in human experiments) suggest

that the threat is negligible. This argument was first used against the possibility of

vacuum decay (Hut and Rees 1983) but is quite general.

An influential analysis of the risk from strange matter was carried out in (Dar, De

Rujula et al. 1999) and formed a key part of the safety report for the RHIC. This

analysis took into account the issue that any dangerous remnants from cosmic rays

striking matter at rest would be moving at high relative velocity (and hence much

less likely to interact) while head-on collisions in accelerators could produce

remnants moving much at much slower speeds. They used the rate of collisions of

cosmic rays in free space to estimate strangelet production. These strangelets would

then be slowed by galactic magnetic fields and eventually be absorbed during star

formation. When combined with estimates of the supernova rate, this can be used to

bound the probability of producing a dangerous strangelet in a particle accelerator.

The resulting probability estimate was < 2 ! 10

-9

per year of RHIC operation.

8

While using empirical bounds and experimentally tested physics reduces the

probability of a theory error, the paper needs around 30 steps to reach its conclusion.

For example, even if there was just a 10

-4

chance of a calculation or modelling error

per step this would give a total P(¬A)!"!0.3%. This would easily overshadow the risk

estimate. Indeed, even if just one step had a 10

-4

chance of error, this would

overshadow the estimate.

A subtle complication in the cosmic ray argument was noted in (Tegmark and

Bostrom 2005). The Earth’s survival so far is not sufficient as evidence for safety,

since we do not know if we live in a universe with ‘safe’ natural laws or a universe

where planetary implosions or vacuum decay do occur but we have just been

exceedingly lucky so far. While this latter possibility might sound very unlikely, all

observers in such a universe would find themselves to be in the rare cases where

8

(Kent 2004) points out some mistakes in stating the risk probabilities in different versions of

the paper, as well as for the Brookhaven report. Even if these are purely typesetting mistakes,

it shows that the probability of erroneous risk estimates is nonzero.

12

their planets and stars had survived, and would thus have much the same evidence

as we do. Tegmark and Bostrom had thus found that in ignoring these anthropic

effects, the previous model had been overly narrow. They corrected for this

anthropic bias and, using analysis from (Jaffe, Busza et al. 2000), concluded that the

risk from accelerators was less than 10

-12

per year.

This is an example of a demonstrated flaw in an important physics risk argument

(one that was pivotal in the safety assessment of the RHIC). Moreover, it is

significant that the RHIC had been running for five years on the strength of a flawed

safety report, before Tegmark and Bostrom noticed and fixed this gap in the

argument. Although this flaw was corrected immediately after being found, we

should also note that the correction is dependent on both anthropic reasoning and on

a complex model of the planetary formation rate (Lineweaver, Fenner et al. 2004). If

either of these, or the basic Brookhaven analysis is flawed, the risk estimate is

flawed.

4.2 Black hole formation

The Large Hadron Collider experiment at CERN was designed to explore the

validity and limitations of the Standard Model of particle physics by colliding beams

of high energy protons. This will be the most energetic particle collision experiment

ever done, which has made it the focus of a recent flurry of concerns. Due to the

perceived strength of the previous arguments on vacuum decay and strangelet

production, most of the concern about the LHC has focused on black hole

production.

None of the theory papers we have found appears to have considered the black holes

to be a safety hazard, mainly because they all presuppose that any black holes would

immediately evaporate due to Hawking radiation. However, it was suggested by

(Dimopoulos and Landsberg 2001) that if black holes form, particle accelerators

could be used to test the theory of Hawking radiation. Thus critics also began

questioning whether we could simply assume that black holes would evaporate

harmlessly.

A new risk analysis of LHC black-hole production (Giddings and Mangano 2008)

provides a good example of how risks can be more effectively bounded through

multiple sub-arguments. While never attempting to give a probability of disaster

(rather concluding "there is no risk of any significance whatsoever from such black

holes") it uses a multiple bounds argument. It first shows that rapid black hole decay

is a robust consequence of several different physical theories (A

1

). Second it discusses

the likely incompatibility between non-evaporating black holes and mechanisms for

neutralising black holes: in order for cosmic!ray–produced stable black holes to be

innocuous but accelerator-produced black holes to be dangerous, they have to be

able to shed excess charge rapidly (A

2

). Our current understanding of physics

suggests both that black holes decay and that even if they didn’t, they would be

unable to discharge themselves. Only if this understanding is flawed will the next

section come into play.

13

The third part, which is the bulk of the paper, models how multidimensional and

ordinary black holes would interact with matter. This leads to the conclusion that if

the size scale of multidimensional gravity is smaller than about 20 nm, then the time

required for the black hole to consume the Earth would be larger than the natural

lifetime of the planet. For scenarios where rapid Earth accretion is possible, the

accretion time inside white dwarves and neutron stars would also be very short, yet

production and capture of black holes from impinging cosmic rays would be so high

that the lifespan of the stars would be far shorter than the observed lifespan (and

would contradict white dwarf cooling rates) (A

3

).

While each of these arguments have weaknesses the force of the total argument

(A

1

,A

2

,A

3

) is significantly stronger by the combination of them. Essentially the paper

acts as three sequential arguments, each partly filling in the grey area (see figure 1)

left by the previous. If the theories surrounding black hole decay fail, the argument

about discharge comes into play, and if against all expectation black holes are stable

and neutral the third argument shows that astrophysics constrains them to a low

accretion rate.

4.3 Implications for the safety of the LHC

What are the implications of our analysis for the safety assessment of the LHC? First,

let us consider the stakes in question. If one of the proposed disasters were to occur,

it would mean the destruction of the earth. This would involve the complete

destruction of the environment, 6.5 billion human deaths and the loss of all future

generations. It is worth noting that this loss of all future generations (and with it, all

of humanity’s potential) may well be the greatest of the three, but a comprehensive

assessment of these stakes is outside the scope of this paper. For the present

purposes, it suffices to observe that the destruction of the earth is at least as bad as 6.5

billion human deaths.

There is some controversy as to how one should combine probabilities and stakes

into an overall assessment of a risk. Some hold that the simple approach of expected

utility is the best, while others hold some form of risk aversion. However, we can

sidestep this dispute by noting that in either case, the risk of some harm is at least as

bad as the expected loss. Thus, a risk with probability p of causing a loss at least as

bad as 6.5 billion deaths is at least as bad as a certain 6.5!!!10

9

p deaths.

Now let us turn to the best estimate we can make of the probability of one of the

above disasters occurring during the operation of the LHC. While the arguments for

the safety of the LHC are commendable for their thoroughness, they are not

infallible. Although the report considered several possible physical theories, it is

eminently possible that these are all inadequate representations of the underlying

physical reality. It is also possible that the models of processes in the LHC or the

astronomical processes appealed to in the cosmic ray argument are flawed in an

important way. Finally, it is possible that there is a calculation error in the report.

Recall equation (1):

(1) P(X)!=!P(X|A)!P(A)!+!P(X|¬A)!P(¬A)

14

P(X) is formed from two terms. The second of these represents the additional

probability of disaster due to the argument being unsound. It is the product of the

probability of argument failure and the probability of disaster given such a failure.

Both terms are very difficult to estimate, but we can gain insight by showing the

ranges they would have to lie within, for the risk presented by the LHC to be

acceptable.

From (1), we obtain that:

(4) P(X) # P(X|¬A)!P(¬A) .

If we let l denote the acceptable limit of expected deaths from the operation of the

LHC, we get: 6.5!!!10

9

P(X) $ l. Combining this with equation (4), we obtain:

(5) P(X|¬A)!P(¬A) $ 1.5!!!10

-10

l .

This inequality puts a severe bound on the acceptable values for these probabilities.

Since it is much easier to grasp this with an example, we shall provide some numbers

for the purposes of illustration. Suppose, for example, that the limit were set at 1000

expected deaths, then P(X|¬A)!P(¬A) would have to be below 1.5!!!10

-7

for the risk to

be worth bearing. This requires very low values for these probabilities. We have seen

that for many arguments, P(¬A) is above 10

-3

. We have also seen that the argument

for the safety of the RHIC turned out to have a significant flaw, which was unnoticed

by the experts at the time. It would thus be very bold to suppose that the argument

for the safety of the LHC was much lower than 10

-3

, but for the sake of argument, let

us grant that it is as low as 10

-4

— that out of a sample of 10,000 independent

arguments of similar apparent merit, only one would have any serious error.

Even with the value of P(¬A) were as low as 10

-4

, P(X|¬A)!would have to be below

0.15% for the risk to be worth taking. P(X|¬A) is the probability of disaster given

that the arguments of the safety report are flawed!and is the most difficult

component of equation (1) to estimate. Indeed, few would dispute that we really

have very little idea of what value to put on P(X|¬A). It would thus seem overly

bold to set this below 0.15% without some substantive argument. Perhaps such an

argument could be provided, but until it is, such a low value for P(X|¬A) seems

unwarranted.

We stress that the above combination of numbers was purely for illustrative

purposes, but we cannot find any plausible combination of the three numbers which

meets the bound and which would not require significant argument to explain either

the levels of confidence or the disregard for expected deaths. We would also like to

stress that we are open to the possibility that additional supporting arguments and

independent verification of the models and calculations could significantly reduce

the current chance of a flaw in the argument.

However, our analysis implies that the current safety report should not be the final

word in the safety assessment of the LHC. To proceed with the LHC on the

arguments of the most recent safety report alone, we would require further work on

estimating P(¬A), P(X|¬A), the acceptable expected death toll, and the value of

15

future generations and other life on earth. Such work would require expertise

beyond theoretical physics, and an interdisciplinary group would be essential. If the

stakes were lower, then it might make sense for pragmatic concerns to sweep aside

this extra level of risk analysis, but the stakes are astronomically large, and so further

analysis is critical. Even if the LHC goes ahead without any further analysis, as is

very likely, these lessons must be applied to the assessment of other high-stakes low-

probability risks.

5. Conclusions

When estimating threat probabilities, it is not enough to make conservative estimates

(using the most extreme values or model assumptions compatible with known data).

Rather, we need robust estimates that can handle theory, model and calculation

errors. The need for this becomes considerably more pronounced for low-probability

high-stake events, though we do not say that low probabilities cannot be treated

systematically. Indeed, as pointed out by (Yudkowsky 2008), if we could not

correctly predict probabilities lower than 10

-6

,

!we could not run lotteries.

Some people have raised the concern that our argument might be too powerful: for it

is impossible to disprove the risk of even something as trivial as dropping a pencil,

then our argument might amount to prohibiting everything. It is true that we cannot

completely rule out any probability that apparently inconsequential actions might

have disastrous effects, but there are a number of reasons why we do not need to

worry about universal prohibition. A major reason is that for events like the

dropping of a pencil which have no plausible mechanism for destroying the world, it

seems just as likely that the world would be destroyed by not dropping the pencil.

The expected losses would thus balance out. It is also worth noting that our

argument is simply an appeal to a weak form of decision theory to address an

unusual concern: for our method to lead to incorrect conclusions, it would require a

flaw in decision theory itself, which would be very big news.

It will have occurred to some readers that our argument is fully applicable to this

very paper: there is a chance that we have made an error in our own arguments. We

entirely agree, but note that this possibility does not change our conclusions very

much. Suppose, very pessimistically, that there is a 90% chance that our argument is

sufficiently flawed that the correct approach is to take safety reports’ probability

estimates at face value. Even then, our argument would make a large difference to

how we treat such values. Recall the example from section 2, where a report

concludes a probability of 10

-9

and we revise this to 10

-6

. If there is even a 10% chance

that we are correct in doing so, then the overall probability estimate would be

revised to 0.9!!!10

-9

+ 0.1!!!10

-6

" 10

-7

, which is still a very significant change from the

report’s own estimate. In short, even serious doubt about our methods should not

move one’s probability estimates more than an order of magnitude away from those

our method produces. More modest doubts would have a effect.

The basic message of our paper is that any scientific risk assessment is only able to

give us the probability of a hazard occurring conditioned on the correctness of its

main argument. The need to evaluate! the reliability of the given! argument ! in order

16

to adequately address the risk was shown to be of particular relevance in low-

probability high-stake events. We drew a three-fold distinction!between theory,

model and calculation, and showed how this can be more useful than the common

dichotomy in risk assessment between model and parameter uncertainties. By

providing historic examples for errors in the three fields, we clarified the three-fold

distinction and showed where flaws in a risk assessment might occur. Our analysis

was applied to the recent assessment of risks that might arise from experiments

within particle physics. To conclude this paper, we now provide some very general

remarks on how to avoid argument flaws when assessing risks with high stakes.

Firstly, the testability of predictions can help discern flawed arguments. If a risk

estimate produces a probability distribution for smaller, more common disasters this

can be used to judge whether the observed incidences are compatible with the

theory. Secondly, reproducibility appears to be the most effective way of removing

many of these errors. By having other people replicate the results of calculations

independently our confidence in them can be dramatically increased. By having

other theories and models independently predict the same risk probability our

confidence in them can again be increased, as even if one of the arguments is wrong

the others will remain. Finally, we can reduce the possibility of unconscious bias in

risk assessment through the simple expedient of splitting the assessment into a ‘blue’

team of experts attempting to make an objective risk assessment and a ‘red’ team of

devil’s advocates attempting to demonstrate a risk, followed by repeated turns of

mutual criticism and updates of the models and estimates (Calogero 2000).

Application of such methods could in many cases reduce the probability of error by

several orders of magnitude.

References

Bloch, J. (2006). "Extra, Extra - Read All About It: Nearly All Binary Searches and

Mergesorts are Broken." from

http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-

nearly.html.

Burchfield, J. D. (1975). Lord Kelvin and the Age of the Earth. New York, Science

History Publications.

Calogero, F. (2000). "Might a laboratory experiment destroy planet earth?"

Interdisciplinary Science Reviews 25(3): 191-202.

Cartwright, N. (1999). Dappled World: A Study of the Boundaries of Science.

Cambridge, Cambridge University Press.

Cokol, M., I. Iossifov, et al. (2007). "How many scientific papers should be retracted?"

Embo Reports 8(5): 422-423.

Dar, A., A. De Rujula, et al. (1999). "Will relativistic heavy-ion colliders destroy our

planet?" Physics Letters B 470(1-4): 142-148.

Dimopoulos, S. and G. Landsberg (2001). "Black holes at the large hadron collider."

Physical Review Letters 8716(16): art. no.-161602.

ESA (1996). ARIANE 5 Flight 501 Failure: Report by the Inquiry Board.

17

García-Berthou, E. and C. Alcaraz (2004). "Incongruence between test statistics and P

values in medical papers." BMC Medical Research Methodology 4(13).

Gartler, S. M. (2006). "The chromosome number in humans: a brief history." Nature

Reviews Genetics 7(8): 655-U1.

Giddings, S. B. and M. M. Mangano. (2008). "Astrophysical implications of

hypothetical stable TeV-scale black holes." arXiv:0806.3381

Giere, R. N. (1999). Science without Laws. Chicago, University of Chicago Press.

Hansson, S. O. (1996). "Decision making under great uncertainty." Philosophy of the

Social Sciences 26: 369-386.

Hempel, C. G. (1950). "Problems and Changes in the Empiricist Criterion of

Meaning." /Rev. Intern. de Philos 11(41): 41-63.

Hillerbrand, R. C. and M. Ghil (2008). "Anthropogenic climate change: Scientific

uncertainties and moral dilemmas." Physica D 237: 2132-2138.

Hut, P. and M. J. Rees (1983). "How Stable Is Our Vacuum." Nature 302(5908): 508-

509.

Jaffe, R. L., W. Busza, et al. (2000). "Review of speculative "disaster scenarios" at

RHIC." Reviews of Modern Physics 72(4): 1125-1140.

Jeng, M. (2006). "A selected history of expectation bias in physics." Am. J. Phys. 74(7):

578-583.

Kent, A. (2004). "A critical look at risk assessments for global catastrophes." Risk

Analysis 24(1): 157-168.

Lineweaver, C. H., Y. Fenner, et al. (2004). "The Galactic habitable zone and the age

distribution of complex life in the Milky Way." Science 303(5654): 59-62.

Miller, G. (2006). "A Scientist’s Nightmare: Software Problem Leads to Five

Retractions." Science 314: 1856-1857.

Morawetz, K. and R. Walke (2003). "Consequences of coarse-grained Vlasov

equations." Physica a-Statistical Mechanics and Its Applications 330(3-4): 469-

495.

Morrison, M. C. (1998). "Modelling nature: Between physics and the physical world."

Philosophia Naturalis 35: 65-85.

NASA (1999). Mars Climate Orbiter Mishap Investigation Board Phase I Report.

Nath, S. B., S. C. Marcus, et al. (2006). "Retractions in the research literature:

misconduct or mistakes?" Medical Journal of Australia 185(3): 152-154.

Nicely, T. R. (2008). "Pentium FDIV Flaw FAQ." from

http://www.trnicely.net/pentbug/pentbug.html.

Nuclear Weapon Archive. (2006). "Operation Castle." from

http://nuclearweaponarchive.org/Usa/Tests/Castle.html.

Panko, R. R. (1998). "What We Know About Spreadsheet Errors." Journal of End User

Computing 10(2): 15-21.

18

Popper, K. (1959). The logic of Scientific Discovery, Harper & Row.

Posner, R. A. (2004). Catastrophe: Risk and Response. Oxford, Oxford University

Press.

Prot, S., J. E. Fontan, et al. (2005). "Drug administration errors and their determinants

in pediatric in-patients." International Journal for Quality in Health Care

17(5): 381-389.

Reichenbach, H. (1938). Experience and prediction. Chicago, University of Chicago

Press.

Stubbs, J., C. Haw, et al. (2006). "Prescription errors in psychiatry - a multi-centre

study." Journal of Psychopharmacology 20(4): 553-561.

Suppes, P. (1957). Introduction to Logic.

Tegmark, M. and N. Bostrom (2005). "Is a doomsday catastrophe likely?" Nature

438(7069): 754-754.

Turner, M. S. and F. Wilczek (1982). "Is Our Vacuum Metastable." Nature 298(5875):

635-636.

Walsh, K. E., C. P. Landrigan, et al. (2008). "Effect of computer order entry on

prevention of serious medication errors in hospitalized children." Pediatrics

121(3): E421-E427.

Witten, E. (1984). "Cosmic Separation of Phases." Physical Review D 30(2): 272-285.

Yudkowsky, E. (2008). Cognitive biases potentially affecting judgement of global

risks. Global Catastrophic Risks. N. Bostrom and M. M. Cirkovic. Oxford,

Oxford University Press.