Conference PaperPDF Available

A Problem in the Bayesian Analysis of Data without Gold Standards

A Problem in the Bayesian Analysis of Data without Gold Standards
Nick Gray, Marco De Angelis, Dominic Calleja and Scott Ferson
Institute for Risk and Uncertainty, University of Liverpool, United Kingdom.
We review methods of calculating the positive predictive value of a test (the probability of having a condition given
a positive test for it) in situations where there is no ’gold standard’ way to determine the true classification. We show
that Bayesian methods lead to illogical results and instead show that a new approach using imprecise probabilities
is logically consistent.
Keywords: diagnostics, Bayes’ rule, false positives, prevalence, sensitivity, specificity, uncertainty, gold standard
1. Introduction
In many applications, the problem with classifica-
tion is the lack of a ‘gold standard’ of evidence.
That is to say when we cannot decisively assign
an observation to a particular category. This phe-
nomenon is pervasive, arising in many fields, from
structural health in engineering, to supervised
learning in computer science, or patient diagnosis
in medicine. Tests, on which classifications rest,
are often imperfect; they yield false alarms (false
positives), fail to detect threats (false negatives),
or are prone to other misclassifications.
For instance, medical practitioners commonly
diagnose a patients health condition based on
some diagnostic, which in isolation is not defini-
tive. Although sometimes multiple tests can be
combined together in order to become definitive.
(Joseph et al. 1995, Albert 2009) The diagnostic
result has some statistical uncertainty associated
with detecting the true health state. Naively in-
terpreting the result from a medical test can lead
to an incorrect assessment for a patient’s true
health condition. Bayes’ rule is commonly used to
estimate the actual probability of some individual
being a member of a class, subject to some piece
of evidence. (Mossman & Berger 2001)
It is often impractical or even impossible to
know gold standard information about classifica-
tion. In medicine there are many diseases for
which there is no way to conclusively determine
whether a patient has a particular disease. For in-
stance, for patients with Giant Cell Arteritis, even
after a biopsy has been undertaken there is still
uncertainty about the true health state. (Hunder
et al. 1990)
There is also the situation where gold stan-
dard information can only be gathered from some
classes, yet for others it is unknown. For ex-
ample, some prison authorities use classification
algorithms to assess whether a prisoner is likely
to reoffend when released from prison on parole.
(Fry 2018) As authorities are unwilling to release
prisoners if the test says that they are likely to
return to crime, it would be impossible to know
whether the recidivism test was accurate or not. If
a prisoner fails the test, and thus remains impris-
oned, there is no available data on whether they
would have re-offended had they been released.
Therefore, we could never know if it was a true
negative or a false negative.
Winkler & Smith (2004) have argued that the
traditional application of Bayes’ rule in medical
counselling is inappropriate and represents a con-
fusion in the medical decision-making literature.
They propose in its place a radically different for-
mulation that makes special use of the information
about the test results for new patients, although
not their actual disease status. As the ground
truth cannot be established, they instead construct
two new confusion matrices; one based upon the
assumption the test is a true positive, and the
other assuming the test is a false positive. They
then make use of these alternative facts in order
to update the tests sensitivity, specificity and un-
derlying prevalence of the disease, thus reducing
the test’s uncertainty asymptotically as the test is
According to Google Scholar, the Winkler and
Smith paper has only been referenced 11 times
since its publication: Jafar et al. (2007), Finkel
(2008), Zuk (2008), Proeve (2009), Weber (2009),
Raab (2010), Low-Choy et al. (2011), Cuevas
(2015), Cuevas et al. (2016), Rzepi´
nski (2018),
Rushdi & Rushdi (2018). The majority of these
are in papers relating to medical decision making.
However, Proeve (2009) concerns child abuse de-
cision making and Low-Choy et al. (2011) con-
cerns plant pest dispersal. Only Cuevas et al.
(2016) appears to actually make use of their
method. Nevertheless, we think their argument
deserves a clear rebuttal because of the centrality
of the issue in diagnostic testing, and the remark-
able delicacy of Bayesian reasoning by which
Proceedings of the 29th European Safety and Reliability Conference.
Edited by
Michael Beer and Enrico Zio
Copyright c
2019 European Safety and Reliability Association.
Published by
Research Publishing, Singapore.
ISBN: 978-981-11-2724-3; doi:10.3850/978-981-11-2724-3 0458-cd 2628
Proceedings of the 29th European Safety and Reliability Conference
such a profound disagreement could emerge and
escape resolution for many years.
2. The Standard Bayesian Approach to
Calculating Positive Predictive Value
Throughout this paper, we will refer to the fol-
lowing hypothetical dataset for TNtrials of a
diagnostic test for condition D: we have αtrue
positives, βfalse positives, γfalse negatives and
δtrue negatives. Let T+be the total number of
positive tests, Tthe total number of negative
tests, TSthe total number of sick people and TW
the total number of well people. This allows us to
construct the confusion matrix shown in Table 1.
Table 1. Sample data.
Has Problem No Problem Total
Positive Test α β T+
Negative Test γ δ T
The probability that someone has the disease
given they have had a positive test result is given
by Pr(D|+), whilst the probability that they do
not have the disease is Pr(¬D|+). Throughout,
we will only consider positive test results, how-
ever all the arguments made could equally apply
to negative test results. The prevalence is given by
the sensitivity by
and the specificity by
We will also define the ratio of positive tests out
of total number tested as
Often the values of p,sand tare published inde-
pendent of each other. In which case the following
notation can also be used: p=pk/pn,s=sk/sn
and t=tk/tn.
Given the published values of p,sand t, Bayes’
rule gives the probability that a patient has D
given they have tested positive is:
Pr(D|+, p, s, t) = ps
ps + (1 p)(1 t).(5)
(Baron 1994, Lesaffre & Lawson 2012) We are
usually more interested in obtaining the proba-
bility that is only conditioned on a positive test
outcome, Pr(D|+), this is also known as the
positivew predictive value (PPV). When p,sand
tare available in scalar form, we can obtain
Pr(D|+) = Pr(D|+, p, s, t).
The Mossman & Berger (2001) paper takes this
a step further and considers the following hypo-
thetical situation:
Mr Smith has tested positive for disorder D,
he asks his doctor the following ”Given the pub-
lished estimates for prevalence, sensitivity and
specificity, what is the 95% confidence interval for
my probability of having Dgiven my positive test
results and the imprecision in the estimates?”
When there is uncertainty about the values of
p,sand tthey can be described by distributions
Smith & Winkler (1999), Smith et al. (2000).
There are a couple of ways in which in PPV can
be determined, the simplest is to estimate it using
the 5 but replacing p,sand twith their expected
Pr (D|+, p, s, t) =
E(p)E(s) + (1 E(p))(1 E(t))
In order to obtain a distribution for the PPV,
Mossman & Berger (2001) use a convolution of
the distributions of p,sand twithin Eq. 5. In
their numerical calculation they sample random
variables from the distributions of p,sand tand
use Eq. 5 to find the distribution of the PPV.
Both Mossman & Berger (2001) and Winkler &
Smith (2004) use Jeffery’s prior for p,sand t:
fj(x|a, b) = B(x|a+ 1/2, b a+ 1/2) (7)
where Bis the beta distribution, however alterna-
tives are available.(Bolstad 2007)
Using priors defined by pk= 5,pn= 100,
sk= 9,sn= 10,tk= 9 and tn= 10, where
the values of p,sand thave come from inde-
pendent trials, the Mossman and Berger method
gives the 95% confidence interval for the PPV as
3. The Winkler and Smith Method
Winkler & Smith (2004) diverges from Mossman
& Berger (2001) and the textbook method, see
Lesaffre & Lawson (2012) as an example, they
argue that the outcome of a patient’s test should
be used to update the distribution. They assert that
the PPV of a positive medical test for a disease is
not Eq. 5, Eq. 6 or Mossman and Berger’s (2001)
”objective bayesian method.” Rather, it should
be computed as a weighted average of assuming
the positive result is a true positive (and accord-
ingly augmenting the estimates for prevalence and
sensitivity) and assuming the positive result is a
false positive (and accordingly decrementing the
estimates for prevalence and specificity).
Proceedings of the 29th European Safety and Reliability Conference
Starting with the same sample data as Table 1,
Winkler and Smith suggest that we should con-
struct two new confusion matrices, one assuming
a true positive and one assuming a false positive,
Table 2 and Table 3 respectively.
Table 2. New confusion matrix assuming true positive.
Has Problem No Problem Total
Positive Test α+ 1 β T++ 1
Negative Test γ δ T
Total TS+ 1 TWTN+ 1
Table 3. New confusion matrix assuming false positive.
Has Problem No Problem Total
Positive Test α β + 1 T+
Negative Test γ δ T+ 1
Total TSTW+ 1 TN+ 1
Assuming a true positive the prevalence, sensi-
tivity and specificity become:
S=TS+ 1
TN+ 1 (8)
S=α+ 1
TS+ 1 (9)
S=t. (10)
Similarly for the assuming false positive case:
TN+ 1 (11)
TW+ 1 (13)
They then say to construct a weighted average
of the new confusion matrices using
Pr(D|+) =
S, s
S, t
S) Pr(D|+, p, s, t)
W, s
W, t
W) Pr(¬D|+, p, s, t).
Where f(p, s, t)is a convolution of the Jeffery’s
prior distributions (Eq 7) for p,sand t:
f(p, s, t) = fj(p|pk, pn)fj(s|sk, sn)fj(t|tk, tn),
Pr(D|+, p, s, t)is calculated using Eq. 6 and
Pr(¬D|+, p, s, t) = 1 Pr(D|+, p, s, t). Re-
turning to the numerical example at the end of 2
(pk= 5,pn= 100,sk= 9,sn= 10,tk= 9
and tn= 10), we find that the 95% confidence
interval is [0.062,0.741]. Notice that the width of
this confidence interval is smaller than that of the
Mossman and Berger method.
3.1. The Logical Inconsistency with the
Winkler and Smith Method
In order to show that the Winkler and Smith
method is flawed we will explore the situation in
which you are conducting more than one test and
show that it leads to a reductio ad absurdum. As
the number of tests approaches infinity, we will
then contrast this with the imprecise probability
approach which, whilst not providing useful in-
formation, at least makes some sense.
As Winkler and Smith make use of the test
result in their calculations, Instead of just consid-
ering the effect of using just one test, we could
also consider what happens after Xpositive tests.
Using the Winkler and Smith method we would
have the following two confusion matrices: Ta-
ble 4 and Table 5.
Table 4. New confusion matrix assuming true positive for X
positive tests.
Has Problem No Problem Total
Positive Test α+X β T++hX
Negative Test γ δ T
Table 5. New confusion matrix assuming true positive for X
positive tests.
Has Problem No Problem Total
Positive Test α β +X T++hX
Negative Test γ δ T
In the assuming true positive case the new
prevalence would be given by
the new sensitivity by
and specificity wouldn’t change.
Using Table 5 we get the new assuming false
positive case we get the new prevalence as
Proceedings of the 29th European Safety and Reliability Conference
and the new sensitivity wouldn’t change
and specificity would be.
We will now consider what happens to p
and t
when the number of tests becomes
large (X→ ∞). We find
+= 1,(22)
+= 1,(23)
+=t, (24)
= 1,(25)
=s, (26)
= 0 (27)
Hence, using Equation 5 we get Pr(D|+) = 1
which implies that Pr(¬D|+) = 0. Naively
interpreting this result at face value would let you
conclude that any positive test at this limit is a
true positive alternatively there is no such thing
as a false negative. Winkler and Smith do not
say to use this result though they would next use
Equation 14 along with Equation 15. At the limit
this give us:
Pr(D|+, p, s, t) =fβ(p|TS+X, TT+X)×
fβ(s|α+X, α +γ+X)×
fβ(t|δ, β +δ).
fβ(x|a, b) = x= 1/2
0x6= 1/2(29)
Pr(x)dx= 1 (30)
then the cumulative distribution for the PPV be-
Pr(D|+, p, s, t) = 0x < 1/2
Figure 1 shows this migration from the first test,
X= 1 towards the result in Equation 31 starting
from the Section 2 example data set (pk= 5,
pn= 100,sk= 9,sn= 10,tk= 9 and
tn= 10). These results amount to a logical flaw in
the Winkler and Smith method. We haven’t added
any new information (apart from the number of
tests) and the uncertainty has reduced. It should
be noted that this asymptote is due to the choice
of prior. Interestingly, and perhaps worryingly,
different priors would give different values in the
limit X→ ∞.
Cumlative distribution
0.00 0.25 0.50 0.75 1.00
0.00 0.25 0.50 0.75 1.00
Fig. 1. Plot of the CDF for the first 1000 tests using the
Winkler and Smith method.
4. Imprecise Probability Approach
It is possible to reconsider the argument made by
Winkler and Smith using a framework provided
by the theory of imprecise probabilities. (Wal-
ley 1991, 1996, Walley et al. 1996) Under this
perspective, the prevalence p, sensitivity s, and
specificity tcan each be updated as prescribed
by Winkler and Smith to yield a distribution for
the PPV assuming the patient is actually sick (in
which case the test was a true positive) and a
distribution for the PPV assuming the patient is
actually not sick (in which case the test was a false
positive). However, the appropriate synthesis of
these two contingent estimates of the PPV is not a
weighted mixture as Winkler and Smith conceive
it. Instead, because whether the patient is sick or
not is precisely what is unknown in this problem,
an envelope of the two distributions would be
more appropriate.
Proceedings of the 29th European Safety and Reliability Conference
Returning to the numerical example in sec-
tion 2, with priors for p,sand timplied by pk= 5,
pn= 100,sk= 9,sn= 10,tk= 9 and
tn= 10. The envelope of the two contingent
distributions yields a rather wide probability box,
Ferson et al. (2003), that is shown as the outer
bounds in black in Figure 2. The leftmost edge
corresponds to the distribution that assumes the
positive test result was a false positive, increment-
ing pand taccording to Table 3 . The rightmost
edge corresponds to the distribution that assumes
the test result was a true positive, incrementing
pand saccording to Table 2. This envelop-
ing calculation is equivalent to a mixture with
unknown weights characterised by the vacuous
interval [0,1] for both distributions. In contrast,
the traditional Bayesian result and the Winkler
and Smith method are both also shown in the
figure. We see that the envelope encloses both
the traditional and the Winkler and Smith distribu-
tions. The 95% confidence interval using impre-
cise probabilities is [0.057,0.848], which as ex-
pected encompanses the interval for both the tradi-
tional Bayesian reusult and the Winkler and Smith
result. This envelope is reminiscent of prob-
abilistic dilation of uncertainty that sometimes
accompanies the addition of weakly informative
data in probabilistic calculations.(Seidenfeld &
Wasserman 1993) In this case, the unverified test
result is certainly information, but it does not seem
to be information that leads to a contraction of
uncertainty about what the test result itself means.
Cumlative distribution
0.00 0.25 0.50 0.75 1.00
0.00 0.25 0.50 0.75 1.00
Imprecise Probability
Fig. 2. P-box showing the distribution envelope for the PPV.
4.1. Logical Consistency
As we said that the Winkler and Smith methods
becomes logically inconsistent when we consider
it in the extreme scenario we will now show that
using imprecise probabilities leads to at least a
logical result in the limit. What the imprecise
probability confusion matrix would be after X
tests is shown in Table 6
Table 6. New imprecise probability confusion matrix after
Xpositive tests
Has Problem No Problem Total
Test α+X[0,1] β+X[0,1] T
Test γ[0,1] δ[0,1] T
Total T
The new prevalence would be
new sensitivity
and new specificity
Now at the X→ ∞ limit:
p= [0,1] (35)
s= 1 (36)
t= 0 (37)
Using these results along with Eq. 5 gives:
Pr(D|+) = [0,1] (38)
as the final value for the PPV Figure 3 shows the
migration to the vacuous p-box using the impre-
cise probability method starting from Mossman &
Berger (2001) sample data.
Proceedings of the 29th European Safety and Reliability Conference
Cumlative distribution
0.00 0.25 0.50 0.75 1.00
0.00 0.25 0.50 0.75 1.00
1 10 100 1000
Fig. 3. Plot of the p-boxes for first 1000 tests using the
imprecise probability method.
5. Discussion
Let us first consider the difference between the
Winkler and Smith method and the imprecise
probability method. Figure 1 shows that as the
number of tests increases the uncertainty on the
PPV decreases. This amounts to a reductio ad
absurdum thus proving their method untenable.
This uncertainty reduction happens even after one
test as demonstrated in the numerical example in
Section 3
In the imprecise version, we have also given
the test no information but the uncertainty has in-
creased, which we argue is reasonable. Although
at the infinity limit the vacuous p-box result is
not useful, it at least makes logical sense. It is
perfectly reasonable to say I don’t know when one
does not know.
Equation 15 assumes that the prevalence, sen-
sitivity and specificity are independent of each
other. In their data, this is a fair assumption as
it states that they all come from different indepen-
dent studies, however this may not always be the
case. For example when conducting non-invasive
neonatal screening for foetal aneuploidy condi-
tions, such as Down’s syndrome, the prevalence
of these conditions changes with the age of the
mother, and as the condition is rare, often studies
of the test statistics are focused on higher risk
categories. (Badeau et al. 2015, Montgomery et al.
2017) Therefore, it is not unimaginable that there
is dependence between p,sand t.
6. Conclusion
We have shown that the method for dealing with
the lack of a gold standard in a classification test
is inappropriate, it leads to the illogical result of
the test becoming less uncertainty after more trails
even though no new information is added. We
have shown that it is possible to reimagine their
method using imprecise probabilities in order to
create logically consistent results.
This research is partly funded by the UK Engi-
neering & Physical Sciences Research Council
(EPSRC) “Digital twins for improved dynamic
design”, through grant number EP/R006768/1
and UK Medical Research Council (MRC)
Treatment According to Response in Giant
cEll arTeritis (TARGET)”, through grant number
MR/N011775/1. The funding and support of EP-
SRC and MRC are greatly acknowledge.
This paper benefited from discussion with
many people, including Masatoshi Sugeno, Jack
Siegrist, Michael Balch and Jason O’Rawe.
Albert, P. S. (2009), ‘Estimating diagnostic ac-
curacy of multiple binary tests with an imper-
fect reference standard’, Statistics in Medicine
28(December 2008), 780–797.
Badeau, M., Lindsay, C., Blais, J., Takwoingi, Y.,
Langlois, S., L´
e, F., Gigu`
ere, Y., Turgeon,
A. F., William, W. & Rousseau, F. (2015),
‘Genomics-based non-invasive prenatal testing
for detection of fetal chromosomal aneuploidy
in pregnant women’, Cochrane Database of
Systematic Reviews (7).
Baron, J. A. (1994), ‘Uncertainty in Bayes’, Med-
ical Decision Making 14(1), 46–51.
Bolstad, W. M. (2007), Introduction to Bayesian
Statistics, 2 edn, Wiley.
Cuevas, J. R. T., Bravo Melo, L. & Achcar, J.
(2016), ‘Estimaci´
on del Valor Predictivo Pos-
itivo de la Colangiopancreatograf´
ıa Magn´
utilizando metodos de Bayes. (Estimation of the
Positive Predictive Value of the Magnetic reso-
nance Cholangiopancreatography using Bayes
methods) (in Spanish)’, Revista M´
edica de Ris-
aralda 22(1), 19–26.
Cuevas, T. (2015), ‘Inferencia Bayesiana e In-
on en salud: un caso de aplicaci´
en diagn´
ostico cl´
ınico (Bayesian Inference and
Health Research: a case of application in clini-
cal diagnosis) (in Spanish)’, Revista M´
edica de
Risaralda 21(3), 9–16.
Ferson, S., Kreinovich, V., Ginzburg, L., My-
ers, D. S. & Sentz, K. (2003), Constructing
Probability Boxes and Dempster-Shafer Struc-
tures, Technical Report January, Sandia Na-
Proceedings of the 29th European Safety and Reliability Conference
tional Lab.(SNL-NM),, Albuquerque, United
Finkel, A. M. (2008), Protecting People in Spite
of–or Thanks to–the Veil of Ignorance, in R. R.
Sharp & G. E. Marchant, eds, ‘Genomics and
Environmental Regulation: Science, Ethics,
and Law’, The John Hopkins University Press,
Balitmore, pp. 290–342.
Fry, H. (2018), Hello World: How to be Human
in the Age of the Machine, WW Norton &
Company Inc, New York.
Hunder, G. G., Bloch, D. A., Michel, B. A.,
Stevens, M. B., Aren, W. P., Calabrese, L. H.,
Edworthy, S. M., Fauci, A. S., Leavitt, R. Y.,
Lie, J. T., Lightfoot Jr., R. W., Masi, A. T.,
McShane, D. J., Mills, J. A., Wallace, S. L. &
Zvaifler, N. J. (1990), ‘The American College
of Rheumatology 1990 criteria for the clas-
sification of giant cell arteritis’, Arthritis &
Rheumatism 33(8), 1122—-1128.
Jafar, T. H., Chaturvedi, N., Hatcher, J. & Levey,
A. S. (2007), ‘Use of albumin creatinine ratio
and urine albumin concentration as a screen-
ing test for albuminuria in an Indo-Asian pop-
ulation’, Nephrology Dialysis Transplantation
22(8), 2194–2200.
Joseph, L., Gyorkos, T. W. & Coupal, L. (1995),
‘Bayesian estimation of disease prevalence and
the parameters of diagnostic tests in the absence
of a gold standard’, American Journal of Epi-
demiology 141(3), 263–272.
Lesaffre, E. & Lawson, A. B. (2012), Bayesian
Biostatistics, John Wiley & Sons, Ltd, Chich-
Low-Choy, S., Hammond, N., Penrose, L., An-
derson, C. & Taylor, S. (2011), Dispersal in a
hurry: Bayesian learning from surveillance to
establish area freedom from plant pests with
early dispersal, in ‘MODSIM2011, 19th Inter-
national Congress on Modelling and Simula-
tion’, Perth, Australia, pp. 2521–2527.
Montgomery, J., Caney, S., Clancy, T., Edwards,
J., Gallagher, A., Greenfield, A., Haimes, E.,
Hughes, J., Jackson, R., Lawrence, D., Pat-
tinson, S. D., Shakespear, T., Siddiqui, M.,
Watson, C., Widdows, H., Wishart, A. & de Zu-
lueta, P. (2017), Non-invasive prenatal testing:
ethical issues, Technical report, Nuffield Coun-
cil of Bioethics.
Mossman, D. & Berger, J. O. (2001), ‘Intervals for
posttest probabilities: A comparison of 5 meth-
ods’, Medical Decision Making 21(6), 498–
Proeve, M. (2009), ‘Issues in the application of
Bayes’ Theorem to child abuse decision mak-
ing’, Child Maltreatment 14(1), 114–120.
Raab, S. (2010), Kidney and Urinary Tract, in
W. Gray & G. Kocjan, eds, ‘Diagnostic Cy-
topathology’, 3 edn, Churchill Livingstone El-
sevier, pp. 365–401.
Rushdi, R. A. & Rushdi, A. M. (2018),
‘Karnaugh-Map Utility in Medical Studies :
The Case of Fetal Malnutrition Karnaugh-Map
Utility in Medical Studies : The Case of Fetal
Malnutrition’, International Journal of Math-
ematical, Engineering and Management Sci-
ences (IJMEMS) 3(3), 220–244.
nski, T. (2018), ‘Twierdzenie Bayesa w pro-
jektowaniu strategii diagnostycznych w medy-
cynie (Making Diagnostic Strategies in Medical
Practice with the Use of Bayes’ Theorem) (in
Polish)’, Diametros 57(57), 39–60.
Seidenfeld, T. & Wasserman, L. (1993), ‘Dilation
for Sets of Probabilities’, The Annals of Statis-
tics 21(3), 1139–1154.
Smith, J. E. & Winkler, R. L. (1999), ‘Casey’s
Problem: Interpreting and Evaluating a New
Test’, Interfaces 29(3), 63–76.
Smith, J. E., Winkler, R. L. & Fryback, D. G.
(2000), ‘The First Positive: Computing Positive
Predictive Value at the Extremes’, Annals of
Internal Medicine 132(10), 804.
Walley, P. (1991), Statistical reasoning with im-
precise probabilities, Chapman and Hall, Lon-
Walley, P. (1996), ‘Inferences from Multinomial
Data: Learning about a Bag of Marbles’, Jour-
nal of the Royal Statistical Society. Series B
(Methodological) 58(1), 3–57.
Walley, P., Gurrin, L. & Burton, P. (1996), ‘Anal-
ysis of Clinical Data Using Imprecise Prior
Probabilities’, Journal of the Royal Statistical
Society. Series D (The Statistician) 45(4), 457–
Weber, K. M. (2009), Making a treatment decision
for breast cancer: Associations among mari-
tal qualities, couple communication, and breast
cancer treatment decision outcomes, PhD the-
sis, The Pennsylvania State University.
final submissions/3502
Winkler, R. L. & Smith, J. E. (2004), ‘On Un-
certainty in Medical Testing’, Medical Decision
Making 24(6), 654–658.
Zuk, T. (2008), Visualizing Uncertainty, PhD the-
sis, The University of Calgary.
46780/Zuk 2008.pdf?sequence=1
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
This paper advocate and demonstrates the utility of the Karnaugh map, as a pictorial manual tool of Boolean algebra, in the exploration of medical problems as exemplified herein by the problem of Fetal Malnutrition (FM). The paper briefly introduces the FM problem, and specifies four metrics or tests used frequently in its study. Clinical data collected about these metrics (as continuous variables or dichotomized versions thereof) are conventionally treated via statistical methods. The Karnaugh map serves as a convenient way for aggregating the set of clinical data available into a pseudo-Boolean function. The map can be used to produce a two-by-two contingency matrix (confusion matrix or frequency matrix) that relates an assessed test or metric to a reference or standard one. Each of these two metrics can be any of the map variables or a function of some or all of these variables. While the map serves in this capacity as a supplement or aid to statistical methods, it is also useful for certain non-statistical methods (specifically Boolean ones). The paper shows how the map entries can be dichotomized via an appropriate threshold for use in Boolean Analysis (BA), which can be conducted despite the lack of a gold standard. The map also implements Qualitative Comparative Analysis (QCA) for the given clinical data. The map variable-handling capability does not pose as a shortcoming for either BA or QCA, since the number of variables involved (not only herein but in other typical medical problems as well) is relatively small. The concepts and methods introduced herein are demonstrated through application to the aforementioned set of clinical data for the FM problem, and can be extended to a wide variety of medical problems.
Full-text available
Background Common fetal aneuploidies include Down syndrome (trisomy 21 or T21), Edward syndrome (trisomy 18 or T18), Patau syndrome (trisomy 13 or T13), Turner syndrome (45,X), Klinefelter syndrome (47,XXY), Triple X syndrome (47,XXX) and 47,XYY syndrome (47,XYY). Prenatal screening for fetal aneuploidies is standard care in many countries, but current biochemical and ultrasound tests have high false negative and false positive rates. The discovery of fetal circulating cell-free DNA (ccfDNA) in maternal blood offers the potential for genomics-based non-invasive prenatal testing (gNIPT) as a more accurate screening method. Two approaches used for gNIPT are massively parallel shotgun sequencing (MPSS) and targeted massively parallel sequencing (TMPS). Objectives To evaluate and compare the diagnostic accuracy of MPSS and TMPS for gNIPT as a first-tier test in unselected populations of pregnant women undergoing aneuploidy screening or as a second-tier test in pregnant women considered to be high risk after first-tier screening for common fetal aneuploidies. The gNIPT results were confirmed by a reference standard such as fetal karyotype or neonatal clinical examination. Search methods We searched 13 databases (including MEDLINE, Embase and Web of Science) from 1 January 2007 to 12 July 2016 without any language, search filter or publication type restrictions. We also screened reference lists of relevant full-text articles, websites of private prenatal diagnosis companies and conference abstracts. Selection criteria Studies could include pregnant women of any age, ethnicity and gestational age with singleton or multifetal pregnancy. The women must have had a screening test for fetal aneuploidy by MPSS or TMPS and a reference standard such as fetal karyotype or medical records from birth. Data collection and analysis Two review authors independently carried out study selection, data extraction and quality assessment (using the QUADAS-2 tool). Where possible, hierarchical models or simpler alternatives were used for meta-analysis. Main results Sixty-five studies of 86,139 pregnant women (3141 aneuploids and 82,998 euploids) were included. No study was judged to be at low risk of bias across the four domains of the QUADAS-2 tool but applicability concerns were generally low. Of the 65 studies, 42 enrolled pregnant women at high risk, five recruited an unselected population and 18 recruited cohorts with a mix of prior risk of fetal aneuploidy. Among the 65 studies, 44 evaluated MPSS and 21 evaluated TMPS; of these, five studies also compared gNIPT with a traditional screening test (biochemical, ultrasound or both). Forty-six out of 65 studies (71%) reported gNIPT assay failure rate, which ranged between 0% and 25% for MPSS, and between 0.8% and 7.5% for TMPS. In the population of unselected pregnant women, MPSS was evaluated by only one study; the study assessed T21, T18 and T13. TMPS was assessed for T21 in four studies involving unselected cohorts; three of the studies also assessed T18 and 13. In pooled analyses (88 T21 cases, 22 T18 cases, eight T13 cases and 20,649 unaffected pregnancies (non T21, T18 and T13)), the clinical sensitivity (95% confidence interval (CI)) of TMPS was 99.2% (78.2% to 100%), 90.9% (70.0% to 97.7%) and 65.1% (9.16% to 97.2%) for T21, T18 and T13, respectively. The corresponding clinical specificity was above 99.9% for T21, T18 and T13. In high-risk populations, MPSS was assessed for T21, T18, T13 and 45,X in 30, 28, 20 and 12 studies, respectively. In pooled analyses (1048 T21 cases, 332 T18 cases, 128 T13 cases and 15,797 unaffected pregnancies), the clinical sensitivity (95% confidence interval (CI)) of MPSS was 99.7% (98.0% to 100%), 97.8% (92.5% to 99.4%), 95.8% (86.1% to 98.9%) and 91.7% (78.3% to 97.1%) for T21, T18, T13 and 45,X, respectively. The corresponding clinical specificities (95% CI) were 99.9% (99.8% to 100%), 99.9% (99.8% to 100%), 99.8% (99.8% to 99.9%) and 99.6% (98.9% to 99.8%). In this risk group, TMPS was assessed for T21, T18, T13 and 45,X in six, five, two and four studies. In pooled analyses (246 T21 cases, 112 T18 cases, 20 T13 cases and 4282 unaffected pregnancies), the clinical sensitivity (95% CI) of TMPS was 99.2% (96.8% to 99.8%), 98.2% (93.1% to 99.6%), 100% (83.9% to 100%) and 92.4% (84.1% to 96.5%) for T21, T18, T13 and 45,X respectively. The clinical specificities were above 100% for T21, T18 and T13 and 99.8% (98.3% to 100%) for 45,X. Indirect comparisons of MPSS and TMPS for T21, T18 and 45,X showed no statistical difference in clinical sensitivity, clinical specificity or both. Due to limited data, comparative meta-analysis of MPSS and TMPS was not possible for T13. We were unable to perform meta-analyses of gNIPT for 47,XXX, 47,XXY and 47,XYY because there were very few or no studies in one or more risk groups. Authors' conclusions These results show that MPSS and TMPS perform similarly in terms of clinical sensitivity and specificity for the detection of fetal T31, T18, T13 and sex chromosome aneuploidy (SCA). However, no study compared the two approaches head-to-head in the same cohort of patients. The accuracy of gNIPT as a prenatal screening test has been mainly evaluated as a second-tier screening test to identify pregnancies at very low risk of fetal aneuploidies (T21, T18 and T13), thus avoiding invasive procedures. Genomics-based non-invasive prenatal testing methods appear to be sensitive and highly specific for detection of fetal trisomies 21, 18 and 13 in high-risk populations. There is paucity of data on the accuracy of gNIPT as a first-tier aneuploidy screening test in a population of unselected pregnant women. With respect to the replacement of invasive tests, the performance of gNIPT observed in this review is not sufficient to replace current invasive diagnostic tests. We conclude that given the current data on the performance of gNIPT, invasive fetal karyotyping is still the required diagnostic approach to confirm the presence of a chromosomal abnormality prior to making irreversible decisions relative to the pregnancy outcome. However, most of the gNIPT studies were prone to bias, especially in terms of the selection of participants.
Criteria for the classification of Churg-Strauss syndrome (CSS) were developed by comparing 20 patients who had this diagnosis with 787 control patients with other forms of vasculitis. For the traditional format classification, 6 criteria were selected: asthma, eosinophilia >10% on differential white blood cell count, mononeuropathy (including multiplex) or polyneuropathy, non-fixed pulmonary infiltrates on roentgenography, paranasal sinus abnormality, and biopsy containing a blood vessel with extravascular eosinophils. The presence of 4 or more of these 6 criteria yielded a sensitivity of 85% and a specificity of 99.7%. A classification tree was also constructed with 3 selected criteria: asthma, eosinophilia >10% on differential white blood cell count, and history of documented allergy other than asthma or drug sensitivity. If a subject has eosinophilia and a documented history of either asthma or allergy, then that subject is classified as having CSS. For the tree classification, the sensitivity was 95% and the specificity was 99.2%. Advantages of the traditional format compared with the classification tree format, when applied to patients with systemic vasculitis, and their comparison with earlier work on CSS are discussed.
This paper describes a new method, based on the theory of imprecise probabilities, for analysing clinical data in the form of a contingency table. The method is applied to a well-known set of statistical data from randomized clinical trials of two treatments for severe cardiorespiratory failure in newborn babies. Two problems are distinguished. The inference problem is to draw conclusions about which treatment is more effective. The decision problem is to determine whether one treatment should be preferred to another for the next patient, or whether it is ethical to select the treatment by randomization. The two problems are analysed using three possible models for prior ignorance about the statistical parameters, and one of the models is modified to take account of earlier clinical data. In this example the four models produce essentially the same conclusions.
The Scientific Method: A Process for LearningThe Role of Statistics in the Scientific Method Main Approaches to StatisticsPurpose and Organization of This Text
Casey, the newborn daughter of one of the authors of this paper, received a positive result on an experimental medical screening test, indicating that she may lack an enzyme required to digest certain fats. The interpretation of this test result was complicated by uncertainty about the false-positiverate for the test-this was the first positive reading-and the prevalence of the medical condition. We used a simple Bayesian model to help assess the probability that Casey actually had the enzyme deficiency and to help better understand the role and value of this screening test. The model we used and, more generally, our style of analysis could also be used with other new diagnostic tests, such as tests used in manufacturing and environmental contexts as well as other medical situations.