ChapterPDF Available

A Giant with Feet of Clay: On the Validity of the Data that Feed Machine Learning in Medicine: IT for Individuals, Communities and Societies

Authors:

Abstract and Figures

This paper considers the use of machine learning in medicine by focusing on the main problem that it has been aimed at solving or at least minimizing: uncertainty. However, we point out how uncertainty is so ingrained in medicine that it biases also the representation of clinical phenomena, that is the very input of this class of computational models, thus undermining the clinical significance of their output. Recognizing this can motivate researchers to pursue different ways to assess the value of these decision aids, as well as alternative techniques that do not “sweep uncertainty under the rug” within an objectivist fiction (which doctors can come up by trusting). © 2019, Springer International Publishing AG, part of Springer Nature.
Content may be subject to copyright.
A giant with feet of clay: on the validity of the
data that feed machine learning in medicine?
Federico Cabitza1,2, Davide Ciucci1, and Raffaele Rasoini3
1Universitá degli Studi di Milano-Bicocca, Milan, Italy
cabitza,ciucci@disco.unimib.it
2IRCCS Istituto Ortopedico Galeazzi, Milan, Italy
3IFCA Istituto Fiorentino di Cura e Assistenza, Florence, Italy
raffaele.rasoini@tiscali.it
Abstract. This paper considers the use of Machine Learning (ML) in
medicine by focusing on the main problem that this computational ap-
proach has been aimed at solving or at least minimizing: uncertainty.
To this aim, we point out how uncertainty is so ingrained in medicine
that it biases also the representation of clinical phenomena, that is the
very input of ML models, thus undermining the clinical significance of
their output. Recognizing this can motivate both medical doctors, in
taking more responsibility in the development and use of these decision
aids, and the researchers, in pursuing different ways to assess the value
of these systems. In so doing, both designers and users could take this
intrinsic characteristic of medicine more seriously and consider alterna-
tive approaches that do not “sweep uncertainty under the rug” within an
objectivist fiction, which everyone can come up by believing as true.
Keywords: Decision Support Systems, Machine Learning, Information
Bias, Algorithmic Bias
1 Motivations and Background
It is a truism to say that uncertainty permeates contemporary medicine – not
much differently than it has always done – as it has also been confirmed by ex-
tensive studies in the field of sociology and medicine itself (e.g., [28,74,79,41]).
However, uncertainty is a broad term, which encompasses many types of short-
comings of human knowledge. We cope with some form of uncertainty when we
cannot pinpoint a phenomenon exactly or when we cannot measure it precisely
(i.e., approximation, inaccuracy); when we do not possess a complete account of
?In case you like to cite this article in your work, would you please consider the
opportunity to mention the version that we published on a book by Springer? This
can be referred as follows: Cabitza, F., Ciucci, D., Rasoini, R.: A giant with feet of
clay: on the validity of the data that feed machine learning in medicine. In: Cabitza,
F., Magni, M., Batini, C. (eds.) Organizing for the Digital World, Lecture Notes in
Information Systems and Organisation, vol. 28, chap. 2, pp. 113–128. Springer, The
address of the publisher (2018) ISBN: 978-3-319-90503-7
a case (incompleteness, inadequacy); when we cannot predict what it will come
next (unpredictability for randomness or excessive complexity); when our ob-
servations seem to contradict each other (inconsistency, ambiguity); and, more
generally, when we are not confident of what we know. In clinical practice, all
of these phenomena occur on a daily basis, several times. Greenhalgh [34] has
recently proposed a taxonomy of uncertainty in medicine: she distinguishes be-
tween uncertainty about the best available evidence (will it be sound? will it be
generalizable?); about the story of patients (are they sincere? are they reliable?);
about case-specific decisions (e.g., what best to do in the circumstances?); and
about social or soft skills (e.g., how best to communicate and collaborate with
colleagues?).
More in general, medical doctors can be uncertain on almost every aspect of
their practice: on how to classify patients’ conditions (diagnostic uncertainty);
why and how patients develop diseases (pathophysiological u.); what treatment
will be more appropriate for them (therapeutic u.); whether they will recover
with or without a specific treatment (prognostic u.), and so on. In this picture,
technology has often been proposed – and seen – as a solution. In the words by
Reiser [73, p. 18]: From the beginning of their introduction in the mid-nineteenth
century, automated machines that generated results in objective formats [...] were
thought capable of purging from health care the distortions of subjective human
opinion [and] to produce facts free of personal bias, and thus to reduce the un-
certainty associated with human choice.
Also computing technology has been proposed to address all of the above
areas of uncertainty – to either try to control or minimize it: the first computa-
tional support, what was then called a rule-based expert system, was introduced
more than 40 years ago to propose a “quantification scheme which attempts to
model the inexact reasoning processes of medical experts” [78].
After the introduction of many and different computational systems [76], a
new class of applications has recently emerged in the health care debate: the
decision support systems (DSS) that embed predictive models that have been
developed by means of machine learning (ML) methods and techniques. These
systems, which for the sake of brevity we will call ML-DSS, have recently raised
a strong interest among the medical practitioners of almost every corner of the
world in virtue of their high accuracy at an unprecedented extent [26,37], in some
cases even allegedly capable of outperforming human experts (e.g., [50,93]). This
is reflected by the stance of influential commentators and medical experts that
have recently shared their thoughts from some of the most impacted journals of
the medical community (e.g. [21,63,19,48,53,30], just to cite a few). These voices
do not clearly indulge in techno-optimistic claims and do not refrain from offer-
ing some words of caution; however, the recent successes of ML-DSS in medical
imaging (e.g., radiography, computed tomography, ultrasonography, magnetic
resonance imaging, retinal fundus photography, skin images) pose the issue of
how these systems and their improved versions, which will likely outperform hu-
man diagnosticians, will impact some medical professions like radiologists and
pathologists [63,16,48], and health care in general [21,19]. In regard to this im-
pact, two elements should be object of further scholarly interest and research,
which are bound together by a feedback loop making their mutual influence sub-
tle but yet hard to pinpoint. First: how ML-DSS can bias human interpretation
and decision, or automation bias. Second: how human interpretation and classifi-
cation can bias the ML-DSS performance, or information bias. While the former
case is still neglected but some first studies are shedding light on it [65,31,9], in
this paper we will focus on the latter case, which is almost completely ignored,
especially by the computer scientists and designers of ML models. Nevertheless
information bias, which we will define in the next section, regards the quality4
of both the training and input data of ML-DSS, and hence has got the poten-
tial to undermine the reliability of the output of ML-DSS. Our point is that a
renovated awareness of the irreducible nature of the uncertainty of medical phe-
nomena, even in regard to their plain representation into medical data, can help
put in the right perspective the current potential of ML-DSS, and motivate the
exploration of alternative ways to conceive them and validate their indications,
as we will outline in the last sections of this contribution.
2 Seeing what we are trained to see
Before considering information bias from a medical perspective, let us recall what
a ML predictive model is. A predictive model, no exception those developed
with a machine learning approach5, are functionally relational models that bind
input data to one category out of a set of predefined ones (most of the times
encompassing only two options, like positive/negative). This latter category is
the output (or prediction) of the model.
To this aim, the model is progressively fine-tuned on the basis of what ML
experts call experience [61, p. 2], that is input data that have been already
classified in terms of a specific category. In the case of medical classification (for
both diagnostic or prognostic aims) the above “experience” is a set of cases that
have been already represented properly and classified “correctly” according to
some gold standard method. In so doing, the machine can learn the model, that
is the hidden structural relationship between the features of the cases (as long
as these are all represented in terms of the same attributes and characteristics),
and hence the correct interpretation of each case. Grounding on this model, the
ML-DSS can “predict” the right category or label when fed with new cases, as
long as these are sufficiently similar to those ones with which it has been trained.
As widely known, ML encompasses many families of models and, within each
family, every model can differ from the others for the countless combinations of
4This is a vague term: here we mean data quality mainly in terms of accuracy and
validity.
5In what follows we introduce the concept of ML predictive model with reference
to supervised discriminative (or classification) models. Although machine learning
models can be of various kinds, the above models are by far the most frequently
used in medicine. What we say for this particular kind of ML is easily applicable to
other forms of ML with just minor changes in wording.
its various internal parameters. The reason why different models perform differ-
ently on the same data is that they represent them differently and make different
assumptions on their intrinsic regularities: in short, they are all differently bi-
ased. This term is often used in the ML community, not always with precision
or consistency, to account for the many factors that condition a model in pre-
dicting the output. The preference for simpler models, or the Occam’s razor, is
an example of active bias in model optimization and selection.
Once again Mitchell has the merit to have given a simple yet clear definition
of bias in the ML discourse: this is any basis for making a specific prediction over
another, other than strict consistency with the observed training instances [60].
The most general kind of bias is inductive bias (also called learning bias6), whose
main components are representational bias (also called language bias) and algo-
rithmic bias (also called procedural bias) [32]. The latter bias regards the number
of assumptions and heuristics by which a model is refined, that is by which the
optimization procedure traverses the space of all possible hypothetical models
and finds the fittest to the available experience. Representational bias is some-
how wider and covers different aspects that are nevertheless connected regarding
what structure the model family assumes to adequately capture regularities in
the data (e.g., a decision tree), what constitutes a complete representation of
the above space of all possible fitting models and, last but not the least, any
assumption in representing entities, states and events in the reality of interest,
that is any modeling choice that is translated in considering certain features
and attributes and not other, in their types and their range of values. Figure 1
illustrates (in spatial terms) the extent of any representation of the reality of
interest.
Thus we come to the main issue at stake in this paper: how much valid
and reliable is the representation of the above “experience” on which ML-DSS
learn their predictive model? Two main biases weigh on this question: the above
representational bias and the so called information bias [1]. By information bias
we intend a collective name for all the human biases that distort the data on
which a decision maker (or a computational decision support) relies on, and that
account for the validity of data, that is the extent these represent what they are
supposed to represent accurately. These two biases, and the related phenomenon
of information variability, all concur in undermining the extent we can be certain
of the available data, and hence of the predictions ML models can infer from
them.
The relationship between representational bias and information bias is mu-
tually conditioning and so tight that distinguishing between them could be im-
proper at a high level of description7, as both regard the shortcomings that are
6The name comes from the main assumption of any learning process, which was first
acknowledged by Hume in the 1700s: “we suppose, but are never able to prove, that
there must be a resemblance betwixt those objects, of which we have had experience,
and those which lie beyond the reach of our discovery” (from A Treatise of Human
Nature, Book I, Part III, Section VI, 1740).
7If we could choose the terminology that we deem more accurate, we would propose
to denote as representational bias the general phenomenon, which encompasses both
Fig. 1. An illustration of the fact that we represent only a small portion of reality. The
three axes represent the cases (C), the dimensions (D) and time (T). The experience
is represented in tabular form in the Experience field (the C-D plane). Dimensions
(or columns in the tabular form) can be added, deleted or renamed over time (plane
of modeling, D-T). Cases (or rows in tabular form) can be added, deleted or, more
frequently, modified (i.e., the output and effect of any data-oriented practice).
inherent to any classifying taxonomy and measurement scales8. To borrow terms
from data-base conceptual modeling, we could see representational bias as af-
fecting the creation of the schema of our experience representation: the choice of
what attributes and value ranges designers deem relevant to adopt, the standard
classifications that the designers considered adequate to pinpoint the aspects of
the reality of interest: in short, any bias in drawing an empty table. On the other
hand, information bias affects the situated activities of filling in this table, the
“instancing” the above schema, and the “projection” of the observed phenomena
into the available data structure. In some ways information bias can be seen as
the “grounding” of representational bias, or the user side of it.
Since ML specialists tend to overlook this side, in the next section we will
characterize this concept further, and discuss its role with respect to uncertainty
in medical decision making.
a modeling bias (regarding the conceptual model of the data) and an information
bias (regarding the description of the reality in terms of that conceptual model).
However, this choice would likely produce more confusion in the specialist literature,
which is already cluttered with different forms of biases and their effects.
8These shortcomings would deserve a study of its own. In a famous work, Star and
Bowker [6, p. 69] hinted at some of these inadequacies, which include: temporal
rigidity; a one-size-fits-all nature in regard to meaning and implications; and, as also
discussed in [84], the reflection of disciplinary interests, agendas and priorities.
3 Information bias, the open secret of medical records
Information bias9accounts for any factor that could undermine the validity of
a representation, that is the extent it truly represents one or more aspects of
the reality of interest: in short what undermines the validity of the available
data. This general bias can take various forms including, most manuals concur,
measurement error, misclassification and miscoding. However, this bias should
not be only associated with errors and mistakes by the “data producers” (in
our case medical doctors, nurses and patients) due to either negligence, fraud,
incompetence, incomprehension or inexperience.
In medicine, information bias can be due to both patients and care givers
in different but intertwined ways (see Figure 2). Patients can (often unaware)
contribute in terms of response bias. This bias occurs when patients either exag-
gerate or understate their conditions (for many reasons and often in good faith),
or whenever they intentionally suppress some information (e.g., like in case of
sexually transmitted diseases or drug history for the related social stigma) or
distort it (e.g., to avoid legal consequences); or when their recall is limited or
flawed, or simply because they do not understand what physicians ask them
or they aim to respond how (they believe) physicians expect them to (cf. the
particular kind of response bias known as “social desirability bias”). A large ex-
tent of response bias can be related to the inability of doctors to get confidence
of the respondent [46]. As an example, a recent study [77] focused on the de-
gree of concordance agreement between patients and cardiologists in regard to
the presence of angina pectoris and its frequency. The study showed that when
patients reported monthly angina symptoms, cardiologists diagnoses were com-
patible with patient reported symptoms only 17% of time, while among patients
who reported more frequent, i.e. daily/weekly angina symptoms, approximately
one quarter of them were noted as having no angina by cardiologists.
Besides the above mentioned condescending bias, it is well known that pa-
tients can exhibit behaviors (cf. Hawthorne effect), or even detectable modifica-
tions of some physiological parameter (e.g., blood pressure in what is also known
as ‘white-coat effect’) when they are under examination in a clinical setting that
they would not exhibit in other settings.
Response bias is tightly associated with the main source of bias from the
physicians’ side: observer bias. This occurs whenever the observer affects a mea-
sure, its accuracy and completeness, which clearly should not depend on the act
of observing and measuring itself. Physicians introduce observer bias in their
data due to both perceptual, cognitive and behavioral traits, weaknesses or just
“bad habits”; this bias also occurs whenever they, “unwittingly (or even intention-
ally) exercise more care about one type of response or measurement than others,
for instance, those supporting a particular hypothesis versus those opposing the
hypothesis” [46] (cf. confirmation bias).
9While biases are, strictly speaking, mental prejudices, idiosyncratic perceptions and
cognitive behaviors producing an either impairing and distorting effect, here we
rather intend the effect (by metonymy), that is the “error” in the data recorded and
the decisions taken caused by the bias itself.
Fig. 2. The main biases affecting the validity and reliability of medical data. Sampling
and Nonresponse biases are indicated to account for the lack of information that, if
present, could reduce uncertainty of representation.
Those who observe a clinical condition are often those who report it in the
medical record. In this case observer bias can blur with what it is denoted as
either recording, reporting, or coding bias. In particular, this latter distorting
factor can be traced back to many causes, from the least common and most
poorly studied, like digit preference and conflictual coding, to the most perva-
sive ones, like the intrinsic inadequacies of any classification schema. Conflictual
coding can affect the accuracy and completeness of medical data when proper
reporting clashes with the personal interests of those who are supposed to doc-
ument a clinical condition (like in the case of blood pressure recordings within a
quality and outcome assessment framework of incentives [14]). Digit preference
occurs when measurements are more frequently recorded ending with 0 or 5, or
as results of arbitrary rounding off. In [42] the authors observed a much larger
occurrence of these two digits in renal cell carcinoma measurement (p<0.0001)
and concluded that this recording behavior could affect the determination of
tumour stage, “with resulting consequences in regard to prognosis and patient
management”. Moreover, coding variability that leads to a lack of reliability can
happen even when instructions on how to proper code are well known: a study [2]
compared consistency of coding supposedly clean and high-quality data-sources
like clinical research forms from observational studies among three professional
coders, each using the same terminology and with the same instructions. All
three coders agreed on the same core concept 33% of the time; two of the three
coders selected the same core concept 44% of the time; and, no agreement among
all three was found 23% of the time. Moreover, no significant level of agreement
beyond that due to chance was found among the experts. The same conclusion
was drawn in a study where three shoulder specialists tried to evaluate 50 pa-
tients with shoulder instability and classified their condition using one of the 16
ICD-9 codes for shoulder instability [83].
Fig. 3. Differences between validity and reliability. Each point is a multi-dimensional
representation of a case (or observation). Adepicts the case of likely reliable data
(though nothing can be said of their validity lacking a reference truth); Bdepicts the
case of less reliable data than A(e.g., because variance is larger, raters could disagree
among each other). Cdepicts the same data of Bbut the ground truth allows to
evaluate bias (i.e., the offset with respect to the bullseye) and hence validity. Ddepicts
the same data as in Band C, but these data result now to be more valid because
a different ground truth is taken as reference. Notably in medicine, differently from
many engineering disciplines, also the ground truth can “move” according to the gold
standard taken as reference (see Section 4).
Last, but not least, information bias in medicine can also be traced back to
some sort of intrinsic ambiguity of the medical conditions being documented, due
to either their instability over time, or to variability across subjects and across
observers. A noteworthy example of this sort of ambiguity can be found in a
recent study by Dharmarajan and colleagues [22]. This study focused on elderly
people diagnosed (at hospital admission) with one of the following conditions:
pneumonia, chronic obstructive pulmonary disease, or heart failure. These are
three common conditions of the elderly that are responsible for breathlessness
and other warning symptoms usually requiring hospital admission. The authors
showed that patients regularly received, during hospital stay, concurrent treat-
ment for two or more of the above cardiopulmonary conditions and not only for
the main diagnosis identified at hospital admission. This exemplifies the fact that
in real-world clinical practice, patients’ clinical pictures are often blurry and not
capable of being associated with clear-cut labels as expressed instead in text-
books and clinical practice guidelines. Indeed, even common clinical syndromes
have disease presentations that often fall in-between traditional diagnostic cat-
egories. The common and relevant overlap of medical treatments as in the case
mentioned above highlights the intrinsic ambiguity of clinical phenomena and
the downstream uncertainty that medical doctors face in choosing what they
deem a single right therapeutic strategy for a specific disorder.
Unfortunately, all of the kinds of variability and biases mentioned above
cannot be conclusively addressed by improving the accuracy of any measurement
tool, or by any other contrivance conceived from the engineering standpoint.
Moreover, the extent these biases are expressed in a clinical setting varies a lot:
although they look as abstract and general categories, biases are always exhibited
by someone in particular, they are highly situated, and depend on personal
skills, like clinical perspicuity, life-long acquired competencies, and contingent
workloads. Since the impact of personal biases are difficult (if not impossible) to
prevent, medical organizations try to minimize them with redundancy of effort,
like relying on double checking and on second (or multiple) opinion. However,
multiple opinions are both a resource to fight biases (by averaging multiple
observations and measures), and, paradoxically enough, a source of low reliability
and further uncertainty (‘quot capita, tot sententiae’).
Indeed, when more than one physician are supposed to determine the pres-
ence of a sign, make a diagnosis, or assess the severity of a condition, observer
variability (see Figure 2) is introduced to account for the discrepancies in their
opinions and for any difference in considering the same conditions. Observer
variability has to do with the reliability of the judgment of so called medical
“raters”, and with the agreement that these latter ones reach independently of
each other when they either measure, classify or interpret the same phenomenon
(e.g. an electrocardiogram, a radiography, a pathological sample, etc.) to make a
decision, mainly a diagnosis (see Figure 4 for an example taking the moon as the
common phenomenon). Observer variability affects the reliability of data, that
is the extent the representations of the same or supposedly similar phenomena
are equivalent or at least consistent with each other. Both validity and reliability
jointly relate to the overall trustworthiness of data but the former is much more
difficult to assess if there is a lack of a reference truth of which one is certain or
sufficiently confident about (see Figure 3).
In medicine, not only multiple raters could classify the same phenomenon in
different ways, but also the same doctor can disagree with herself, examining the
same case after a certain amount of time, or in different environment conditions
(e.g., with respect to workload, interruption rate, work shift). In the first case,
researchers speak of inter-rater agreement; in the latter case of intra-rater agree-
ment. Examples abound. For instance, in [47] the authors report the case of plain
abdominal radiographies submitted to 3 different radiologists to detect the pres-
ence or absence of residual stone fragments: differences among radiologists were
found in the 52% of the reports, and even by the same radiologist rereading the
films 24% of the time. As another example, we mention the case of heart gallop
rhythm detection on cardiac auscultation, originally described almost 200 years
ago [64], an objective sign that is associated with the serious clinical syndrome
of heart failure. Despite the long lasting experience among clinicians all over the
11 Three further remarks: Our color-related example regards only different ways to
categorize colors, not to perceive them. Incidentally, the incidence of color blind-
ness among African people is approximately half the incidence among Caucasians.
Moreover, the fact that recognizing different shapes in the same scene can be traced
back to the known phenomenon of psychological pareidolia is irrelevant to our aims.
For us this is just a common-sense example of possible multiple interpretations of
the same phenomenon. Lastly, the careful reader will have noticed that the words
moon,month,measure and. . . medicine all derive from the same root *me-, Proto-
IndoEuropean for “to measure (appropriately)”.
Fig. 4. A, B and C illustrate a simple case of homomodality, where different appear-
ances of the moon are commonly classified with a single label: waxing moon. Moreover
an English speaking observer would denote A as purple, B as brown and C as blue.
An observer from the Ivory Coast speaking the local language Wobé would denote
all of them as ‘Kpe’. D and E show a case of symmodality, where the same observ-
able phenomenon (full moon) can be associated with two different labels: rabbit and
lady11.
world with this warning objective sign (also due to the widespread availability
and use of stethoscopes for cardiac auscultation among all medical disciplines),
it has been shown [58] that the agreement between expert or unexperienced ob-
servers and the phonocardiographic gold standard in the correct identification
of gallop rhythm is very poor, with overall inter-rater agreement resulting little
better than chance alone.
To account for the extent the majority decision (in case of multiple raters)
or the category chosen more frequently can be considered reliable, and hence
“true”, the so called inter-rater reliability (IRR) is measured, as we will see in
the next section.
4 Between Gold Standards and Ghost Standards
A “gold standard” (or with a less evocative but more correct expression “criterion
standard”) is a reference method to ascertain medical truth or evidence. By
‘reference’ here we mean the ‘best one’ under reasonable conditions, that is the
method that ‘by definition’ is capable to pose the so called “ground truth”12
for any practical aim, including a scientific research and ML-DSS development.
12 This is an expression borrowed from cartography, where it indicates information
that is acquired by direct observation (as opposed to inference, intelligence, reports,
maps, etc.) in an actual field check at a location.
However, the degree of truth that a gold standard usually reaches is far from
resembling an accurate, unambiguous and unique representation of medical facts
that computer scientists long for their “ground-truthing”, i.e., the process of
gathering objective data to train a ML model.
In fact there are many possible gold standards, even for the same disorder,
ranging from autopsy (post-mortem) examination (in many cases the strongest
source) to the opinion of independent raters that could not receive strict indi-
cations on how to code what they observe and interpret (that is a much more
common situation). However even in regard to those tests that are usually con-
sidered the most reliable and definitive gold standards, like post-mortem, his-
tological and genetic examinations, whenever there is a human factor (i.e., any
human actor involved in observing anything), variability, and hence uncertainty,
can emerge [87], [86], [7] as if the observers were called to observe and account
for phantom phenomena. In all of these cases, IRR scores can give an idea of
the extent the data that doctors collect, which glitter in medical datasets, are
golden or alloyed.
Medical researchers use several techniques to measure IRR: the most fre-
quently used is the Cohen’s Kappa, although this is applicable only to categorical
values assigned by two raters. Recently also the Krippendorf’ Alpha has found
some application, probably for its known advantages on the kappa, like the ca-
pability to address any number of raters (not just two), values of any level of
measurement (i.e., categorical, ordinal, interval, etc.), and datasets with missing
values, which are very common in medical records.
All these metrics are intended to assess the degree of agreement beyond chance
(that is considering the fact that raters can agree not only because they believe
the same thing, but also by chance). As such, there is a lot of controversy on
their validity (lacking any model of how chance affects the rater decisions, let
alone of the different ways to misinterpret a phenomenon). Little consensus there
is, above all, regarding how to interpret their scores. Medical scholars usually
rely on a convention from the 1970s by Landis and Koch [52], where the authors
indicated a fair agreement for reliabilities between 0.21 and 0.40, moderate agree-
ment when its measure is between 0.41 and 0.60, substantial agreement when
between 0.61 and 0.80, and almost perfect agreement above 0.81. This convention
established over time and spread wide, although the authors themselves defined
the above divisions “clearly arbitrary” (p. 165). More recently, Krippendorf ad-
mitted that there is no set answer to the question “what is the acceptable level
of reliability?” [51] and that the answer could only “be related to the validity
requirements imposed on the research results, specifically to the costs of draw-
ing wrong conclusions [and whether] the analysis will affect someone’s life” (p.
325). Accordingly, and more conservatively than Landis and Koch, Krippendorf
suggested to “rely only on variables with reliabilities above .8, [. . . ] to consider
variables with reliabilities between .67 and .80 only for drawing tentative conclu-
sions [and furthermore warned that] “even a cutoff point of a = .80 – meaning
only 80% of the data are coded or transcribed to a degree better than chance – is
apretty low standard by comparison to standards used in engineering, archi-
tecture, and medical research (p. 325, emphasis added). As a matter of fact,
medical data reach seldom such an agreement level.
To give some concrete examples of this last point, let us consider the am-
bit where ML-DSS have recently reached levels of diagnostic accuracy (at least)
on a par with human doctors [37]. Even in this case, the reported IRR scores
for the corresponding reference standards are quite low in light of the above
recommendations. Here we mention two cases: diabetic retinopathy, which is
cause of 1 case of blindness out of 10; and skin cancer, which results in ap-
proximately 80,000 deaths a year and it is the most common form of cancer in
many Western countries. In the former case [37], a convolutional neural network
was recently trained and then evaluated in regard to the detection of diabetic
retinopathy in a wide dataset of retinal fundus photographs. In this study, the
authors reported high levels of both sensitivity and specificity for their ML-DSS
according to the gold standard that they decided to adopt, i.e. the majority
decision of a panel of board-certified ophthalmologists analyzing the same reti-
nal fundus photographs. As a matter of fact, some authors reported that the
adoption of this gold standard can be considered controversial [92], since this
may compare unfavorably with other gold standards used in previous studies on
diabetic retinopathy (i.e. standardized centralized assessment of images or opti-
cal coherence tomography). In fact, the prevalence of diabetic retinopathy may
vary significantly whether this condition is evaluated through monocular fundus
photographs or, rather, through optical coherence tomography [89]. This could
turn out to be relevant since diagnostic accuracy metrics are dependent on the
prevalence of diseases according to the Bayes’ theorem. Thus two questions are at
stake here. On the one hand, whether similar successful results would have been
obtained if a different gold standard (e.g., optical coherence tomography) had
been used. On the other hand, even if we assume eye (fundus) photographs as
an indisputable, unique and reliable gold standard for the diagnosis of diabetic
retinopathy, IRR among opthalmic care providers has been shown to be very
low (i.e., kappa between 0.27 and 0.34 for different diagnostic analyses); and
still inadequate, even if higher, among retina specialists (kappa between 0.58
and 0.63) [75]. Another recent successful use of ML in diagnostics refers to the
high accuracy exhibited by a convolutional neural network aimed at diagnosing
skin cancers by differentiating those cases from benign skin lesions [26]. In this
case, the performance of the ML-DSS was evaluated against 21 dermatologists
on biopsy-proven clinical images. Biopsy results are histological examinations,
and are therefore considered the most reliable and “standard” gold standard. De-
spite this common assumption, far from optimal IRR scores have been observed
among dermopathologists considering the histological diagnosis of clinically dif-
ficult cutaneous lesions, with kappa values ranging from 0.31 to 0.80 according
to different diagnostic analyses [7].
5 Garbage in, Gospel out
The question of the quality of medical record and of the data extracted from there
is still understudied [81,10], let alone in regard to machine learning projects [27].
The assumption that medical data could support secondary uses has been chal-
lenged since almost 25 years ago, and also strongly so, e.g., by Reiser [71], who
described several cases of erroneous, missing and ambiguous data, and by Bur-
num [8], who provocatively wrote that “all medical record information should be
regarded as suspect; much of it is fiction” (p. 484)”. Burnum even added that
the introduction of health information technology had not led to improvements
in the quality of medical data recorded therein, but rather to the recording of
a greater quantity of “bad data”13 . In those same years, van der Lei was among
the first ones to warn against the reuse of clinical data for other objectives other
than care and proposed what since then is known as the first law of informatics:
‘[d]ata shall be used only for the purpose for which they were collected’ [54].
In light of the phenomena of both low quality and uncertainty that are intrin-
sic to the production of medical data14 , what are the main implications for the
machines that are fed with this information? As widely known, many factors can
contribute in downsizing the performance of a ML-DSS. Just to mention a few
that we observed in the hospital domain: the fact that medical data seldom meet
the common assumptions that training data should ideally possess. For instance,
their attributes are seldom independent and identically distributed (IDD); their
distribution is neither uniform nor normal; missing data do not occur randomly
(in fact they often indicate an either good or just better health condition that
relieves practitioners from the need to record it with continuous effort, see be-
low); data can be strongly unbalanced with respect to the number of healthy and
positive cases, or to the real prevalence of a pathological condition; they are not
temporally stable (for instance, computer interpretations of electrocardiograms
recorded just one minute apart were found significantly different in 4 of 10 cases
in [80]); they can fall short of representing the target population (sampling bias)
or to make explicit any potential confounding variable (especially those related
to “external” medical interventions [68]).
In this view, misclassified cases by information bias could be seen as just
another issue to cope with (although one of the most serious ones [29]). However,
our point is that considering misclassification only as a defect of data collection is
a conceptual error as long as it is considered a mis-classification: as we saw above,
it should be more properly considered as a classification where independent
observers disagree and classify the same phenomenon differently, to the best of
their competence, perspicacity and perceptual acuity.
13 Moreover, Burnum traced back this lie of the land to “standards of care and a
reimbursement system [that is] blind to biologic diversity”.
14 The reader should notice that low quality can be assessed only ex-post, as informa-
tion bias, by comparison with an unbiased (if any) gold standard. By assuming a
certain amount of bias in medical data, and by measuring actual observer variability,
available medical data can be considered uncertain ex-ante.
This variability if often neglected even by doctors, and few studies indulge in
reporting low IRR scores (also because this kind of studies is believed to disturb
the doctors’ morale [72, p. 193]). No wonder then that the related uncertainty is
dispelled as closely as possible to the “source”, as also the official guidelines for
medical coding and reporting15 ratify in an explicit way: “If the diagnosis doc-
umented at the time of discharge is qualified as ‘probable’, ‘suspected’, ‘likely’,
‘questionable’, ‘possible’, or ‘still to be ruled out’, or other similar terms indi-
cating uncertainty, code the condition as if it existed or was established” [23,
p.90]. Alternatively uncertainty is sublimated in the (statistically significant)
consensus of a sufficiently wide group of experts [82].
Lastly, adopting different gold standards can affect ML-DSS significantly.
We illustrate this by mentioning the case of Carpal Tunnel Syndrome (CTS):
this is a kind of functional hand impairment that is frequently observed and
is due to the compression of the median nerve at the wrist. This syndrome is
commonly diagnosed and often referred to surgical treatment through two dif-
ferent gold standards whenever suspected by different specialists i.e. the sole
physical examination by orthopedic surgeons or a nerve conduction examination
(electromyography) by neurologists [4]. In the last years, alternative diagnostic
methods have been proposed to improve diagnostic accuracy for CTS, like the
ultrasonography of the median nerve of the arm. These tests have shown different
results in accuracy metrics when compared to the current standards mentioned
above (i.e. physical examination or electromyography) [4]. These diagnostic di-
vergences, if neglected in the training of a ML classifier aimed at the diagnosis
of CTS [59], may result in the ossification into the model of an arbitrarily partial
version of the ground truth (that is whether patient X is really affected by CTS
or this syndrome can be ruled out) and hence to unpredictable downstream clin-
ical consequences. For instance, it has been observed that a number of patients
diagnosed with CTS who had undergone surgery did not receive any relevant
benefit from the invasive treatment, and that this could be explained in terms
of wrong upstream diagnoses [33].
A ML-DSS that has learned the uncertain (i.e., right for a standard, wrong
for another) mapping between the patient’s features and one single diagnosis will
propose its advice within a dangerous “close-world assumption” (that is: all the
relevant features have been considered and the mapping between the input and
the desired output is acquired as accurate and reliable), which is never challenged
by design. In other words, the model could suffer from algorithmic bias, that is
discriminate patients according to an arbitrary preference for a gold standard
over an alternative one16.
15 (International Classification of Diseases, Ninth Revision, Clinical Modification,
ICD9-CM).
16 As said in Section 2, algorithmic bias regards any assumption or heuristics that is
adopted to make prediction faster or easier. In the ML literature, this latter term has
recently acquired a more human-flavor, indicating the extent a classifier can reiterate
or even exacerbate the typical discriminations affecting human classification or deci-
sion, like those based on gender, race, religion, health and income [39]. Algorithmic
bias has then acquired the sense of a factor explaining any ML classification that,
On the other hand, in the open world of hospital wards physicians are used to
observer variability and less-than-perfect gold standards whenever they consult
the medical data that are produced by their colleagues (and even by themselves in
other work shifts, cf. the concept of intra-rater agreement). Conversely, designers
of computational systems usually do not consider the case that the input of
their system can be inherently and irremediably biased and inaccurate (to some
extent), they assume it true. The primary concern of ML-DSS designers are the
completeness, timeliness and consistency [10] of the datasets that they feed into
the machines. There is little (if any) recognition that medical data could not
be any better than “dirty” data with which to think to optimize a ML model
adequately would be highly optimistic if not over-ambitious. Critical thinking
would then suggest to look with some caution at the high accuracy rates that
are often reported in the specialist literature [37] [26] even assuming that model
overfitting [24] has been duly avoided. In fact, training data could be “good”
with respect to a gold standard, but dubious according to an alternative gold
standard.
This potential divergence is hardly considered when medical data are taken
from the context where they have been natively produced to support coordina-
tion, knowledge sharing, sense making and decision making and they are trans-
formed into data sets to feed in some computational systems [12]. Neglecting
the gap between the primary use of medical data (i.e., care) and any secondary
use (e.g., ML-DSS training) could mislead those who have to design trustworthy
decision support systems, and also probably jeopardize the actual improvement
of the ML-DSS performance on new and real data other than the training data.
This points to the difference between research data, which are usually used
for ML-DSS training and optimization; and real-world data, which are produced
in real-life clinical situations. While research data are not made up on purpose to
get high accuracy, they are nevertheless selected, cleansed, and engineered [91] to
an extent that is completely unrealistic or unfeasible to replicate in actual clin-
ical settings (e.g., like in [55]). This is not only a matter of generalizability and
interpretability of the model. It is also a matter of different ways to evaluate the
ML-DSS performance. The most common one can be considered essentialist [13],
in that it focuses on accuracy and other performance measures (like F1-scores,
and AUC-ROC) that are appraised in a laboratory setting (i.e., in laboratorio).
An alternative and still neglected approach, which we can denote as consequen-
tialist, focuses on the actual consequences (i.e. health outcomes) produced in
situated practice (i.e., in labore), that is in the original context of work of the
physicians involved and in their actual relationship with patients, when deci-
if made by a human, would likely denote unfair inclination or prejudice. However,
technically speaking what now is considered a form of algorithmic bias would not
even be considered a bias according to the Mitchell’s definition (see Section 2), as
long as the basis for a discriminatory prediction is actually latent and implicit in
the training data themselves. For this latter kind of algorithmic bias, the phrase
“machine bias” could perhaps be preferable, as in [3], to avoid potentially confusing
homonyms, its vagueness notwithstanding.
sions must be converted into rel-life choices that must align with the patients’
attitudes, preferences, fears and hopes (as well as the economic feasibility of the
available options).
6 Embracing uncertainty, also in computation
As hinted above, there are many types of uncertainty in medicine, which affect
medical records in different ways. For a certain attribute (i.e., variable) that is
pertinent for a certain case, users could ignore what value is applicable, let alone
true; or what single value is true among a finite set of values that are known to be
equally applicable. Users could be uncertain between two values from the above
set, or among many. Moreover, they could prefer some options with respect to
others. If single users are certain about a value, they could nevertheless disagree
among each other (and even with themselves over time). Ultimately, they could
be uncertain among different values at various levels of confidence with respect
to each other (e.g., in a dichotomous domain, which is the simplest, doctor A is
fairly certain that the condition is pathological, doctor B is strongly certain).
As shown by Svensson et al. in [82], the performance of ML-DSS is negatively
impacted and deeply undermined when they are fed with medical datasets that
are intrinsically uncertain. Their idea is to employ conventional statistical tests
to reduce the variability of the data produced by different observers by choosing
the values that have been proposed by a statistically significant majority of the
observers (e.g., 9 experts out of 12). However effective, this could be also seen
as a way to discard the richness of a multi-value representation that accounts
for a manifold phenomenon, which competent and skillful observers can describe
each in her own, and partially sound yet specifically irreproachable, way. With
reference to the example above, are simply 3 experts wrong, or maybe see things
that the others cannot?
Thus, if we take the “dirtiness” and “manifoldness” (seen as sides of the same
coin) of medical data as a given factor of medicine, one could wonder: how can
ML techniques take these constraints seriously, and even exploit them to get
a richer picture of the phenomena of interest? How could ML researchers who
are deeply aware of the medical complexities build models that could support
human experts in their daily, and uncertain, practice?
As said in Section 1, different kinds of uncertainty can be considered. Also,
different names can be used to denote different (and similar) kinds of uncertainty.
Indeed, if we look at the different classifications and taxonomies that have been
proposed to describe uncertainty [66], we can find a long list of terms, like:
Absence, Ambiguity, Approximation, Belief, Conflict, Confusion, Fuzziness, Im-
precision, Inaccuracy, Incompleteness, Inconsistency, Incorrectness, Irrelevance,
Likelihood, Non-specificity, Probability, Randomness, just to mention a few.
Each kind of uncertainty can be addressed with specific conceptual tools
to represent and manage it: probability theory, fuzzy sets, possibility theory,
evidence theory, rough sets, just to mention a few also in this case. By large,
the predominant role is played by probability theory, and machine learning is
not an exception in this attitude. However, there exist solutions (in some case
preliminary attempts) to incorporate other tools in machine learning (see for in-
stance [44,45,20,5,88]). There are several reasons why these approaches are not
well established in ML, as widely discussed in [45] for the fuzzy set case: some-
times the new tools are naive; there is not communication and cross-fertilization
among different scholarly communities; there is a problem of credibility for many
recent disciplines, which at least are not as much established as probability the-
ory already is. However, if we want to deal with all of the different forms of
uncertainty that we recognize affecting medical data, a specific and direct way
to manage each of them, also in ML design and optimization, is preferable to a
one-size-fit-all approach. In what follows, we give some ideas on how to proceed,
making reference to the biases discussed in the previous sections of the paper.
At first, let us consider the problem of representing a rater reliability. It is
commonly assumed that ratings are exact, though they may be classified either
as “deterministic” or “random”. Here “random” means that “the rater is uncer-
tain about the response category” [38]. More than a question of randomness,
this description points to a form of epistemic uncertainty which can be handled
by not assuming exactness, but rather introducing a sort of graduality on the
judgment scale of a rater. For instance, we could have three levels of certainty
(i.e., low/fair/good) on the assigned rating and/or the rater can express her
uncertainty by selecting more than one category with its own level of certainty.
For instance, a physician could affirm to be sure with a good confidence that
a patient suffers from disease A. Another one can be undecided between dis-
ease A and disease B with a low certainty on both. This kind of uncertainty
can be applied also to the input data (symptoms) and we can represent the
fact that a patient does perhaps suffer from headache and surely from nausea.
This situation can be handled with Possibility Theory and, in particular, with
its simplified form of certainty-based model [69], which is more interpretable
and simple from a computational standpoint. As an example, let us consider
the data in Table 1. The attribute “bicuspid aortic valve”, which is dichotomic
(yes/no), is associated with a three-way expression of certainty (namely, confi-
dent/highly confident/sure). We also notice that these expressions are ordered,
i.e., confident<highly confident<sure. Similarly, the attribute “mitral rigurgita-
tion” is uncertain with respect to the value to be assigned to patients P5 and P6.
This uncertainty is expressed through two alternative options (thus excluding
all the other values). Of course, ML-DSS have to be modified to comply with
this model, and some steps in this direction already exist [43,40,17].
A related (but different) problem is the possibility to express the nuances of
a medical condition. For instance, a patient could suffer (with certainty) from
mild headache and strong nausea or, with reference to Table 1, a severe mitral
rigurgitation. We notice that in a dichotomous situation we would be forced
to say no headache, yes nausea and yes mitral rigurgitation17. Moreover, these
nuances could be further differentiated by expressing the level of how much a
17 Let us emphasize that this is different from saying that headache has a probability
of 0.2and nausea a probability of 0.8: probability implies the fact that we are more
Patient Mitral rigurgitation Acute dyspnea Bicuspid aortic valve EKG stress test
P1 No Heart failure
(highly confident)
Yes
(sure)
Positive
(highly confident)
P2 Mild COPD
(highly confident)
No
(confident)
Negative
(highly confident)
P3 Moderate Pneumonia
(highly confident)
No
(confident) Not performed
P4 Severe
Heart failure
(highly confident)
or pneumonia
(low confident)
Yes
(highly confident)
Negative
(confident)
P5 Moderate
or Severe
Heart failure
(highly confident)
or pneumonia
(highly confident)
Undetermined Undetermined
P6 Mild or
Moderate
COPD or
pulmonary embolism
or pneumonia
Not applicable Not available
P7 Undetermined Heart failure
and pneumonia Not applicable Positive
(confident)
Table 1. An exemplificatory dataset extracted from the field of work that has not
undergone any post-processing and data cleansing task. Not available means a missing
value that nevertheless should be represented. Not applicable denotes a structurally
missing value. COPD stands for ‘Chronic Obstructive Lung Disease’
condition is, say, low, high, and severe in terms of degrees or normalized scores.
This kind of information can be modeled by fuzzy sets [44,45] as exemplified in
Figure 5 for the attribute fever.
The above two approaches of possibility theory and fuzzy sets are adequate to
model uncertainty in the input data of the ML-DSS, uncertainty that can then
be transferred to the output. On the other hand, three-way decision has the
scope to produce an uncertain output (in front of a certain or uncertain input)
[94,95]. In this framework, objects are classified according to three levels instead
of two, as in standard decision theory. In the present contest, these three levels
can have different interpretations, like ill/healthful/unknown, disease-A/disease-
B/undecided. Clearly, the new introduced third level (unknown/undecided) ex-
presses an indecision on the outcome which may be addressed by making new
investigations on the patient at hand.
As a further problem, let us consider a numerical information, such as the
one obtained by some measurement. Of course, any point value brings some
extent of imprecision: this can be due, e.g., to the instrument itself, or related to
the average value among repeated measures. Thus, we can consider to represent
or less sure on a dichotomic choice, while it does not represent any actual nuance of
either clinical signs or symptoms.
Fig. 5. A typical representation of a so-called linguistic variable, fever in this case,
with four values: very low, low, medium and high, each represented by a fuzzy set. The
ordinate reports the measured temperature of patients, the abscissa the membership
degree of patients to the four fuzzy sets. The reader should notice that the boundaries
between the fuzzy sets do not necessarily have to be symmetrical and that, given an
abscissa (i.e. a temperature value) the sum of the ordinates (i.e., percentages) must
not necessarily equal 1.
the information in terms of intervals and work with them natively. To this aim,
interval arithmetics [70] and fuzzy arithmetics [57] give the formal instruments
to operate with this kind of representation.
Finally, it is well known that data often come with missing values. A stan-
dard approach in ML is to impute them in order to get a complete dataset, since
several techniques require completeness to work properly. Imputation, as it is
clear, “corrupts” the original information and some spurious values are necessar-
ily introduced. On the contrary, we should not get rid of the missing values and
rather consider that a missing value can have several meanings: ignorance of the
real value; intrinsic non-existence; and even, as plain as it can be, that “all is
well” (or “nothing new happened”) and the clinicians did not deem necessary to
produce a new data [62]. An example of the “hidden meaning” of missing values
has been recently highlighted in a study on cardiovascular risk prediction where
several ML models were able to exhibit a higher predictive accuracy when com-
pared to a traditional risk model (the American Heart Association/American
College of Cardiology risk prediction algorithm) [90]. In this study, body mass
index (BMI), which was included among the variables tested for cardiovascular
risk prediction, resulted in many missing values, and ’missing BMI’ turned out
to be a variable selected by the deep learning model for predicting a lower risk of
cardiovascular events, i.e. it was selected as a protective variable. This is consis-
tent with the post-hoc acknowledgment of the habit of many physicians not to
record the value of BMI in patients believed to be at lower global cardiovascular
risk, especially when this index was likely in the normal range (i.e., the patient
was not clearly overweight or obese). Rough set theory includes rule induction
methods that work in presence of missing values and also for different meanings
of it. In particular, we point the attention to the works by Gryzmala-Busse and
his MLEM algorithm (see for instance [36,35]).
7 Conclusions
Machines can seem so accurate, so right. They can make us forget
who made them, and who designed into them – with all the
possibilities of human frailty and error – the programs that dictate
their function. They can make us forget the hands and minds
behind their creation; they can make us forget ourselves.
Stanley Joel Reiser [73]
Fox [28] in her relevant work on the sociology of medical knowledge once
wrote that uncertainty has become the hallmark of the entire field of medicine.
For this reason, confronting uncertainty has been the first and foremost driver for
the introduction of computational Decision Support Systems (DSS) in medicine
and their increasingly wide adoption in clinical settings [67]. We could just spec-
ulate on why medicine has turned to technology to “make sense of health data”18
(to cite a position paper on Nature published a couple of years ago [25]). Quite
subtly, Katz [49] has argued that the traditional mechanisms that physicians use
to adopt to cope with uncertainty (e.g., terminological standards, standard care
protocols, guidelines based on statistical studies) could have had a role in slowly
pushing them towards disregarding or even opposing uncertainty.
Irrespective of the root causes of this situation, the digitization of medicine
has contributed to shifting the idea of uncertainty, from being a natural and
irreducible element of medical practice [74,79] in the interpretation of subtle
and sometimes contradictory clues in the existential and complex context of
idiosyncratic patients, into the domain of those rational problems that can be
modeled to pursue an engineering solution, or even a computational one.
In this paper we have briefly explored the blurring boundaries between what
computer scientists and medical doctors pursue in medical data: data accuracy
and completeness the former ones [81]; trustworthiness and meaningfulness the
latter ones [10]. We have also shed light on information bias and observer vari-
ability, which separate us from getting an absolute true, universal and reliable
representation of a physical (let alone psychological or mental) phenomenon. In
particular for the ML designers, we have pointed out that information bias does
18 We could not exclude that this was partly due to the greater expectations of pa-
tients, who perhaps were more fatalistic in the past and now are rather consumers
of health services and demanding clients of medical consultants; but, more paradox-
ically, reading pieces like the Nature one, it looks like technology has created the
problem by making available more data to “sift through and find answers to questions
about health” (ibidem) that now it is called to solve, e.g., with artificial intelligence
softbots that crawl among million of articles to suggest the best scientific evidence
available for the case at hand.
not regard only the labelling of data set, i.e., the information on which a ML-DSS
is trained to predict other labels accurately; but it also (and above all) affects
the whole input data, in both training and prediction, especially in regard to
nominal and ordinal variables.
In light of these different viewpoints, we outline a couple of recommenda-
tions along the general framework by Domingos, who conceives ML problems as
a combination of representation,optimization and evaluation [24]. From the rep-
resentation perspective, computer scientists should not settle for “polished data”
but rather “get to the source” of medical data: the multiple, possibly divergent
opinions of experts. This means to be wary of researches where the gold standard
is not reported or it is a dataset annotated by a single, or just a couple of physi-
cians (especially if kappa or similar agreement scores are below .8). Moreover,
if the adopted gold standard is based on a consensus reconciliation of divergent
opinions, the authors of those researches should also be aware that they pro-
ceed considering all of these divergences as plain mistakes. If they are less than
certain that this is fair, they should offer a word of caution on the potential ar-
bitrariness of the clearly-cut classification they have used in their study. In the
study design phase authors could also ask the competent observers the degree
of self-perceived confidence with which they share their ideas and produce their
data. This ordinal scale could be used to weight the multiple values of a sin-
gle representation, so that the ML algorithms can leverage again the knowledge
of the domain experts to build a coherently fuzzy representation. Furthermore
they could annotate the representativeness of each value in terms of a three-way
partitioning (e.g, belonging to the majority opinion, belonging to the minority,
belonging to neither with statistical significance, as done in [11]). In any case,
ML researchers should always report how they did collect their ground truth,
and be explicit on what gold standard they relied on.
In regard to optimization, further research should be devoted in transferring
techniques and methods from the rough set theory (e.g., [85]) domain into the
ML arena, as seen above. In regard to evaluation, the ball could be passed to
the medical practitioners again. They should develop a wariness of any essen-
tialist evaluations of ML-DSS performance that are carried out in laboratorio
and on research data. Rather, they should demand to the ML-DSS designers
(and their advocates) evidence-based validations of their systems [18], and adopt
them only once some further information has been given about, e.g., the size of
the diagnostic improvement detected, the trade-off between specificity (avoid-
ing false positive, i.e., overtesting and overtreatment) and sensitivity (avoiding
false negatives, i.e., failing to treat and cure); between the internal (i.e., bias)
and external (i.e., variance) validity of the model (regarding also the extent the
ML-DSS could fit multimorbid cases, instead of being excessively specialized for
one disease); and between its prediction power and its interpretability [15], that
is its scrutability by doctors and lay users to understand why the ML-DSS has
suggested them a certain decision over possible others [56] and make the “hy-
brid” agency of man-and-machine more accountable towards the colleagues, the
patients and their dears.
Even more than that, ML-DSS should be object of a value-based assessment,
where researchers invest time and effort on the evaluation of their systems in the
mid- and long-term after their deployment in real settings and their appraisal
is conducted in terms of user and patient satisfaction, in terms of effect size on
clinical outcomes, and eventually in terms of cost reduction or better service
provision. All these elements should not be overlooked or given for granted,
especially in light of the perils of automation bias (such as deskilling, technology
overreliance and overdependance [31]) not least the surreptitious increase of trust
by doctors in numbers and the “objective facts” (cf. McNamara fallacy) that
the reckless application of machine learning in response to an excessive human
yearning for certainty could bring in, especially in fields where this is likely to
be only a dream of ignorance.
References
1. Althubaiti, A.: Information bias in health research: definition, pitfalls, and adjust-
ment methods. Journal of multidisciplinary healthcare 9, 211 (2016)
2. Andrews, J.E., Richesson, R.L., Krischer, J.: Variation of snomed ct coding of
clinical research concepts among coding experts. Journal of the American Medical
Informatics Association 14(4), 497–506 (2007)
3. Angwin, J., Larson, J., Mattu, S., Kirchner, L.: Machine bias: There’s software
used across the country to predict future criminals. and it’s biased against blacks.
ProPublica, May 23 (2016)
4. Bachmann, L.M., Jüni, P., Reichenbach, S., Ziswiler, H.R., Kessels, A.G., Vögelin,
E.: Consequences of different diagnostic ‘gold standards’ in test accuracy research:
Carpal tunnel syndrome as an example. International journal of epidemiology
34(4), 953–955 (2005)
5. Bello, R., Falcon, R.: Rough Sets in Machine Learning: A Review, pp. 87–118.
Springer International Publishing, Cham (2017)
6. Bowker, G.C., Star, S.L.: Sorting things out: Classification and its consequences.
MIT press (2000)
7. Braun, R., Gutkowicz-Krusin, D., Rabinovitz, H., Cognetta, A., Hofmann-
Wellenhof, R., Ahlgrimm-Siess, V., Polsky, D., Oliviero, M., Kolm, I., Googe, P.,
et al.: Agreement of dermatopathologists in the evaluation of clinically difficult
melanocytic lesions: how golden is the ‘gold standard’? Dermatology 224(1), 51–
58 (2012)
8. Burnum, J.F.: The misinformation era: the fall of the medical record. Annals of
Internal Medicine 110(6), 482–484 (1989)
9. Cabitza, F.: Breeding electric zebras in the fields of Medicine. ArXiv e-prints
(2017), https://arxiv.org/abs/1701.04077
10. Cabitza, F., Batini, C.: Information quality in healthcare. In: Data and Information
Quality, chap. 13, pp. 421–438. Springer (2016)
11. Cabitza, F., Ciucci, D., Locoro, A.: Exploiting collective knowledge with three-way
decision theory: Cases from the questionnaire-based research. International Journal
of Approximate Reasoning 83, 356–370 (2017)
12. Cabitza, F., Locoro, A.: Human-data interaction in healthcare: Acknowledging use-
related chasms to design for a better health information. In: the Procs of the 8th
IADIS International Conference on e-Health (2016)
13. Cappelletti, P.: Appropriateness of diagnostics tests. International journal of lab-
oratory hematology 38(S1), 91–99 (2016)
14. Carey, I., Nightingale, C., DeWilde, S., Harris, T., Whincup, P., Cook, D.: Blood
pressure recording bias during a period when the quality and outcomes framework
was introduced. Journal of human hypertension 23(11), 764 (2009)
15. Caruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M., Elhadad, N.: Intelligible
models for healthcare: Predicting pneumonia risk and hospital 30-day readmission.
In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. pp. 1721–1730. ACM (2015)
16. Chockley, K., Emanuel, E.: The end of radiology? three threats to the future prac-
tice of radiology. Journal of the American College of Radiology 13(12), 1415–1420
(2016)
17. Ciucci, D., Forcati, I.: Certainty-based rough sets. In: Rough Sets - International
Joint Conference, IJCRS 2017 Proceedings (2017), in press
18. Coiera, E.: Looking for evidence-based medical informatics. Recenti progressi in
medicina 107(3), 124–126 (2016)
19. Darcy, A.M., Louie, A.K., Roberts, L.W.: Machine learning and the profession of
medicine. Jama 315(6), 551–552 (2016)
20. Denœux, T., Kanjanatarakul, O.: Evidential Clustering: A Review, pp. 24–35
(2016)
21. Deo, R.C.: Machine learning in medicine. Circulation 132(20), 1920–1930 (2015)
22. Dharmarajan, K., Strait, K.M., Tinetti, M.E., Lagu, T., Lindenauer, P.K., Lynn,
J., Krukas, M.R., Ernst, F.R., Li, S.X., Krumholz, H.M.: Treatment for multiple
acute cardiopulmonary conditions in older adults hospitalized with pneumonia,
chronic obstructive pulmonary disease, or heart failure. Journal of the American
Geriatrics Society 64(8), 1574–1582 (2016)
23. for Disease Control, C., Prevention, et al.: Icd-9-cm official guidelines for coding
and reporting. Tech. rep., Centers for Medicare & Medicaid Services, Atlanta, GA,
USA (2011)
24. Domingos, P.: A few useful things to know about machine learning. Communica-
tions of the ACM 55(10), 78–87 (2012)
25. Elliott, J.H., Grimshaw, J., Altman, R., Bero, L., Goodman, S.N., Henry, D.,
Macleod, M., Tovey, D., Tugwell, P., White, H., et al.: Informatics: Make sense
of health data. Nature 527, 31–32 (2015)
26. Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., Thrun, S.:
Dermatologist-level classification of skin cancer with deep neural networks. Nature
542(7639), 115–118 (2017)
27. Feldman, K., Faust, L., Wu, X., Huang, C., Chawla, N.V.: Beyond volume: The
impact of complex healthcare data on the machine learning pipeline. arXiv preprint
arXiv:1706.01513 (2017)
28. Fox, R.C.: Medical uncertainty revisited. Handbook of social studies in health and
medicine pp. 409–425 (2000)
29. Frénay, B., Verleysen, M.: Classification in the presence of label noise: a survey.
IEEE transactions on neural networks and learning systems 25(5), 845–869 (2014)
30. Gillies, A.: Viewpoint: Embracing uncertainty. Br J Gen Pract 67(658), 215–215
(2017)
31. Goddard, K., Roudsari, A., Wyatt, J.C.: Automation bias: a systematic review
of frequency, effect mediators, and mitigators. Journal of the American Medical
Informatics Association 19(1), 121–127 (2012)
32. Gordon, D.F., Desjardins, M.: Evaluation and selection of biases in machine learn-
ing. Machine learning 20(1), 5–22 (1995)
33. Graham, B.: The diagnosis and treatment of carpal tunnel syndrome:
Surgery—whether open or closed—works, but only if the diagnosis is right. BMJ:
British Medical Journal 332(7556), 1463 (2006)
34. Greenhalgh, T.: Uncertainty and clinical method. In: Clinical uncertainty in pri-
mary care, pp. 23–45. Springer (2013)
35. Grzymala-Busse, J.W.: A comparison of some rough set approaches to mining
symbolic data with missing attribute values. In: Kryszkiewicz, M., Rybinski, H.,
Skowron, A., Ras, Z.W. (eds.) Foundations of Intelligent Systems - 19th Interna-
tional Symposium, ISMIS 2011, Warsaw, Poland, June 28-30, 2011. Proceedings.
Lecture Notes in Computer Science, vol. 6804, pp. 52–61. Springer (2011)
36. Grzymala-Busse, J.W., Grzymala-Busse, W.J.: Handling missing attribute values.
In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Hand-
book, pp. 33–51. Springer US, Boston, MA (2010)
37. Gulshan, V., Peng, L., Coram, M., Stumpe, M.C., Wu, D., Narayanaswamy, A.,
Venugopalan, S., Widner, K., Madams, T., Cuadros, J., et al.: Development and
validation of a deep learning algorithm for detection of diabetic retinopathy in
retinal fundus photographs. JAMA 316(22), 2402–2410 (2016)
38. Gwet, K.: Handbook of inter-rater reliability. STATAXIS Publishing Company
(2001)
39. Hajian, S., Bonchi, F., Castillo, C.: Algorithmic bias: From discrimination discovery
to fairness-aware data mining. In: Proceedings of the 22nd ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining. pp. 2125–2126. ACM
(2016)
40. Haouari, B., Amor, N.B., Elouedi, Z., Mellouli, K.: Naïve possibilistic network
classifiers. Fuzzy Sets and Systems 160(22), 3224–3238 (2009)
41. Hatch, S.: Uncertainty in medicine (2017)
42. Hayes, S.: Terminal digit preference occurs in pathology reporting irrespective of
patient management implication. Journal of clinical pathology 61(9), 1071–1072
(2008)
43. Hüllermeier, E.: Possibilistic instance-based learning. Artif. Intell. 148(1-2), 335–
383 (2003)
44. Hüllermeier, E.: Fuzzy sets in machine learning and data mining. Appl. Soft Com-
put. 11(2), 1493–1505 (2011)
45. Hüllermeier, E.: Does machine learning need fuzzy logic? Fuzzy Sets and Systems
281, 292–299 (2015)
46. Indrayan, A., Holt, M.: Concise Encyclopedia of Biostatistics for Medical Profes-
sionals. CRC Press (2017)
47. Jewett, M., Bombardier, C., Caron, D., Ryan, M., Gray, R., St Louis, E., Witchell,
S., Kumra, S., Psihramis, K.: Potential for inter-observer and intra-observer vari-
ability in x-ray review to establish stone-free rates after lithotripsy. The Journal
of urology 147(3), 559–562 (1992)
48. Jha, S., Topol, E.J.: Adapting to artificial intelligence: radiologists and pathologists
as information specialists. JAMA 316(22), 2353–2354 (2016)
49. Katz, J.: The silent world of doctor and patient. JHU Press (2002)
50. Kooi, T., Litjens, G.J.S., van Ginneken, B., Gubern-Mérida, A., Sánchez, C.I.,
Mann, R., den Heeten, A., Karssemeijer, N.: Large scale deep learning for computer
aided detection of mammographic lesions. Medical Image Analysis 35, 303–312
(2017)
51. Krippendorff, K.: Content analysis: An introduction to its methodology. Sage
(2012)
52. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical
data. biometrics pp. 159–174 (1977)
53. Leachman, S.A., Merlino, G.: Medicine: The final frontier in cancer diagnosis. Na-
ture 542(7639), 36–38 (2017)
54. van der Lei, J., et al.: Use and abuse of computer-stored medical records. Methods
Archive 30, 79–80 (1991)
55. Li, X., Liu, H., Du, X., Zhang, P., Hu, G., Xie, G., Guo, S., Xu, M., Xie, X.:
Integrated machine learning approaches for predicting ischemic stroke and throm-
boembolism in atrial fibrillation. In: AMIA Annual Symposium Proceedings. vol.
2016, p. 799. American Medical Informatics Association (2016)
56. Lipton, Z.C.: The mythos of model interpretability. arXiv preprint
arXiv:1606.03490 (2016)
57. Lodwick, W.A.: Fundamentals of Interval Analysis and Linkages to Fuzzy Set The-
ory, pp. 55–79. John Wiley & Sons, Ltd (2008)
58. Lok, C.E., Morgan, C.D., Ranganathan, N.: The accuracy and interobserver agree-
ment in detecting the ‘gallop sounds’ by cardiac auscultation. Chest 114(5), 1283–
1288 (1998)
59. Maravalle, M., Ricca, F., Simeone, B., Spinelli, V.: Carpal tunnel syndrome auto-
matic classification: electromyography vs. ultrasound imaging. TOP 23(1), 100–123
(2015)
60. Mitchell, T.M.: The need for biases in learning generalizations. Department of
Computer Science, Laboratory for Computer Science Research, Rutgers Univ. New
Jersey (1980)
61. Mitchell, T.M.: Machine learning. 1997. Burr Ridge, IL: McGraw Hill 45(37), 870–
877 (1997)
62. de Mul, M., Berg, M.: Completeness of medical records in emergency trauma care
and an it-based strategy for improvement. Medical informatics and the Internet in
medicine 32(2), 157–167 (2007)
63. Obermeyer, Z., Emanuel, E.J.: Predicting the future—big data, machine learning,
and clinical medicine. The New England journal of medicine 375(13), 1216 (2016)
64. O’Farrell, P.: What is gallop rhythm? Irish Journal of Medical Science (1926-1967)
14(10), 729–739 (1939)
65. Parasuraman, R., Manzey, D.H.: Complacency and bias in human use of automa-
tion: An attentional integration. Human Factors: The Journal of the Human Fac-
tors and Ergonomics Society 52(3), 381–410 (2010)
66. Parsons, S.: Qualitative approaches for reasoning under uncertainty. The MIT
Press, Cambridge, Massachussets (2001)
67. Patel, V.L., Shortliffe, E.H., Stefanelli, M., Szolovits, P., Berthold, M.R., Bellazzi,
R., Abu-Hanna, A.: The coming of age of artificial intelligence in medicine. Artifi-
cial intelligence in medicine 46(1), 5–17 (2009)
68. Paxton, C., Niculescu-Mizil, A., Saria, S.: Developing predictive models using elec-
tronic medical records: challenges and pitfalls. In: AMIA Annual Symposium Pro-
ceedings. vol. 2013, p. 1109. American Medical Informatics Association (2013)
69. Pivert, O., Prade, H.: A certainty-based model for uncertain databases. IEEE
Trans. Fuzzy Systems 23(4), 1181–1196 (2015)
70. Ramon Moore, R.B., Cloud, M.: Introduction to interval analysis. SIAM (2009)
71. Reiser, S.J.: The clinical record in medicine part 2: Reforming content and purpose.
Annals of internal medicine 114(11), 980–985 (1991)
72. Reiser, S.J.: Medicine and the Reign of Technology. Cambridge University Press
(1981)
73. Reiser, S.J., Anbar, M.: The machine at the bedside: Strategies for using technology
in patient care. Cambridge University Press (1984)
74. Rosenfeld, R.M.: Uncertainty-based medicine. Otolaryngology–Head and Neck
Surgery 128(1), 5–7 (2003)
75. Ruamviboonsuk, P., Teerasuwanajak, K., Tiensuwan, M., Yuttitham, K., for Dia-
betic Retinopathy Study Group, T.S., et al.: Interobserver agreement in the inter-
pretation of single-field digital fundus images for diabetic retinopathy screening.
Ophthalmology 113(5), 826–832 (2006)
76. Schwartz, W.B., Patil, R.S., Szolovits, P.: Artificial Intelligence in Medicine. New
England Journal of Medicine 316(11), 685–688 (Mar 1987)
77. Shafiq, A., Arnold, S.V., Gosch, K., Kureshi, F., Breeding, T., Jones, P.G., Bel-
trame, J., Spertus, J.A.: Patient and physician discordance in reporting symptoms
of angina among stable coronary artery disease patients: Insights from the angina
prevalence and provider evaluation of angina relief (appear) study. American heart
journal 175, 94–100 (2016)
78. Shortliffe, E.H., Buchanan, B.G.: A model of inexact reasoning in medicine. Math-
ematical biosciences 23(3-4), 351–379 (1975)
79. Simpkin, A.L., Schwartzstein, R.M.: Tolerating uncertainty—the next medical rev-
olution? New England Journal of Medicine 375(18), 1713–1715 (2016)
80. Spodick, D.H., Bishop, R.L.: Computer treason: intraobserver variability of an
electrocardiographic computer system. The American journal of cardiology 80(1),
102–103 (1997)
81. Stetson, P.D., Bakken, S., Wrenn, J.O., Siegler, E.L., et al.: Assessing electronic
note quality using the physician documentation quality instrument (pdqi-9). Appl
Clin Inform 3(2), 164–174 (2012)
82. Svensson, C.M., Hubler, R., Figge, M.T.: Automated classification of circulating
tumor cells and the impact of interobsever variability on classifier training and
performance. Journal of immunology research 2015 (2015)
83. Throckmorton, T.W., Dunn, W., Holmes, T., Kuhn, J.E.: Intraobserver and inter-
observer agreement of international classification of diseases, ninth revision codes
in classifying shoulder instability. Journal of shoulder and elbow surgery 18(2),
199–203 (2009)
84. Timmermans, S., Berg, M.: The gold standard: The challenge of evidence-based
medicine and standardization in health care. Temple University Press (2010)
85. Tsumoto, S.: Medical diagnosis: Rough set view. In: Thriving Rough Sets, pp.
139–156. Springer (2017)
86. Van Driest, S.L., Wells, Q.S., Stallings, S., Bush, W.S., Gordon, A., Nickerson,
D.A., Kim, J.H., Crosslin, D.R., Jarvik, G.P., Carrell, D.S., et al.: Association
of arrhythmia-related genetic variants with phenotypes documented in electronic
medical records. Jama 315(1), 47–57 (2016)
87. Veress, B., Gadaleanu, V., Nennesmo, I., Wikström, B.: The reliability of autopsy
diagnostics: inter-observer variation between pathologists, a preliminary report.
International Journal for Quality in Health Care 5(4), 333–337 (1993)
88. Wang, X., Zhai, J.: Learning with uncertainty. CRC Press (2017)
89. Wang, Y.T., Tadarati, M., Wolfson, Y., Bressler, S.B., Bressler, N.M.: Comparison
of prevalence of diabetic macular edema based on monocular fundus photography
vs optical coherence tomography. JAMA ophthalmology 134(2), 222–228 (2016)
90. Weng, S.F., Reps, J., Kai, J., Garibaldi, J.M., Qureshi, N.: Can machine-learning
improve cardiovascular risk prediction using routine clinical data? PloS one 12(4),
e0174944 (2017)
91. Wiens, J., Wallace, B.C.: Editorial: special issue on machine learning for health
and medicine. Machine Learning 102(3), 305–307 (2016)
92. Wong, T.Y., Bressler, N.M.: Artificial intelligence with deep learning technology
looks into diabetic retinopathy screening. JAMA 316(22), 2366–2367 (2016)
93. Wu, H., Deng, Z., Zhang, B., Liu, Q., Chen, J.: Classifier model based on machine
learning algorithms: Application to differential diagnosis of suspicious thyroid nod-
ules via sonography. American Journal of Roentgenology 207(4), 859–864 (2016)
94. Yao, Y.: Rough sets and three-way decisions. In: Ciucci, D., Wang, G., Mitra, S.,
Wu, W. (eds.) Rough Sets and Knowledge Technology - 10th International Con-
ference, RSKT 2015, held as part of the International Joint Conference on Rough
Sets, IJCRS 2015, Tianjin, China, November 20-23, 2015, Proceedings. Lecture
Notes in Computer Science, vol. 9436, pp. 62–73. Springer (2015)
95. Yao, Y.: Three-way decisions and cognitive computing. Cognitive Computation
8(4), 543–554 (2016)
... Even different professional clinicians may provide different annotations for the same medical image, reflecting a unique level of uncertainty not seen in natural images. This uncertainty causes incomplete version of the ground truth, potentially resulting in unpredictable clinical outcomes Cabitza et al. (2018); Garcia et al. (2015). Therefore, developing methodologies specifically to manage the uncertainties inherent in medical images becomes crucial. ...
Preprint
Although segmenting natural images has shown impressive performance, these techniques cannot be directly applied to medical image segmentation. Medical image segmentation is particularly complicated by inherent uncertainties. For instance, the ambiguous boundaries of tissues can lead to diverse but plausible annotations from different clinicians. These uncertainties cause significant discrepancies in clinical interpretations and impact subsequent medical interventions. Therefore, achieving quantitative segmentations from uncertain medical images becomes crucial in clinical practice. To address this, we propose a novel approach that integrates an \textbf{uncertainty-aware model} with \textbf{human-in-the-loop interaction}. The uncertainty-aware model proposes several plausible segmentations to address the uncertainties inherent in medical images, while the human-in-the-loop interaction iteratively modifies the segmentation under clinician supervision. This collaborative model ensures that segmentation is not solely dependent on automated techniques but is also refined through clinician expertise. As a result, our approach represents a significant advancement in the field which enhances the safety of medical image segmentation. It not only offers a comprehensive solution to produce quantitative segmentation from inherent uncertain medical images, but also establishes a synergistic balance between algorithmic precision and clincian knowledge. We evaluated our method on various publicly available multi-clinician annotated datasets: REFUGE2, LIDC-IDRI and QUBIQ. Our method showcases superior segmentation capabilities, outperforming a wide range of deterministic and uncertainty-aware models. We also demonstrated that our model produced significantly better results with fewer interactions compared to previous interactive models. We will release the code to foster further research in this area.
... This data set is referred to as the ground truth, as it represents the real-world 'truth' of the data to which the AI otherwise has no access (e.g. a picture of a dog with the label 'dog' and a picture of a cat with the label 'cat'). It is essential to realise that a range of factors can influence medical judgement: human mistakes, biases, missing data, disagreements, etc. (Cabitza et al. 2019). Therefore, the ground truth set is bound to include some level of uncertainty. ...
Article
Full-text available
Machine learning (ML) has emerged as a promising tool in psychiatry, revolutionising diagnostic processes and patient outcomes. In this paper, I argue that while ML studies show promising initial results, their application in mimicking clinician-based judgements presents inherent limitations (Shatte et al. in Psychol Med 49:1426–1448. https://doi.org/10.1017/S0033291719000151, 2019). Most models still rely on DSM (the Diagnostic and Statistical Manual of Mental Disorders) categories, known for their heterogeneity and low predictive value. DSM's descriptive nature limits the validity of psychiatric diagnoses, which leads to overdiagnosis, comorbidity, and low remission rates. The application in psychiatry highlights the limitations of supervised ML techniques. Supervised ML models inherit the validity issues of their training data set. When the model's outcome is a DSM classification, this can never be more valid or predictive than the clinician’s judgement. Therefore, I argue that these models have little added value to the patient. Moreover, the lack of known underlying causal pathways in psychiatric disorders prevents validating ML models based on such classifications. As such, I argue that high accuracy in these models is misleading when it is understood as validating the classification. In conclusion, these models will not will not offer any real benefit to patient outcomes. I propose a shift in focus, advocating for ML models to prioritise improving the predictability of prognosis, treatment selection, and prevention. Therefore, data selection and outcome variables should be geared towards this transdiagnostic goal. This way, ML can be leveraged to better support clinicians in personalised treatment strategies for mental health patients.
... This model displayed a high accuracy of 0.988, implying its strong predictive capability. The research's reliability was demonstrated through the consistency of results when applied to separate test and training datasets (Cabitza et al., 2019). The ML model was repeatedly trained and validated, thereby ensuring reproducibility. ...
... Such an approach maintains reproducibility and transparency for the evaluation of Machine Learning (Wojtusiak, 2021). Thus, as recommended by Cabitza et al. (2019), the presented study minimized the uncertainty of the Deepnet model. 'Validity' confirms the extent to which the results indicate; what they are expected to exhibit. ...
Preprint
Considering the resurgence of Covid-19, the research presents one of the avenues for the betterment of psychological health. It studies the impact of the pandemic on students and ruminates on mitigating the negative consequences using a Virtual Assistant (VA). In an experimental setting, the study focuses on 'stress' as a critical factor that relates to students' well-being. The research is exploratory in nature, where mixed-mode data is captured from students using a questionnaire and integrated Virtual Assistant simultaneously. With considerations of ten independent variables, the regression technique is applied to generate valuable insights into students' well-being. The research findings establish the role of Virtual Assistant to support preventive, predictive, and personalized health. It illuminates Heart Rate Variability as one of the key indicators of perceived stress. The application of the UTAUT model offers Spatial Flexibility as well as Cognitive Assistance that further enhances the Self-efficacy of students. As this research is based on the extensive literature on a method called 'Photoplethysmography', it can further be scaled for supporting large groups of students. The outcome of the study can be further extended to industries where stress might be detrimental to well-being. The research provides cognizance towards actors that enhance their resource density using technology through continuous interactions to co-create value. The study contributes to the innovative application of data science in a mixed-mode research methodology. The same is established in an empirical manner using 'Deepnet' and 'Topic modelling' to reflect on students' well-being.
... Reliability is assessed after the dataset has been trained (using 80% of the data) and the model tested (using 20% of the remaining training dataset). As a result, the study's findings reduced the uncertainty associated with machine learning modelling in retail (Cabitza et al., 2019). To prevent ineffective models, quality checks were done on the data, feature importance, and model matrices. ...
Preprint
Using a data-driven approach, it delivers a strategy for the retail sector to provide the best of both worlds. The application of Machine Learning improves the yield with optimal strategic choices derived from several aspects.
... basic emotions); it is reliable if the outcome of its recognition is consistent when applied to the same objects (i.e. a subject's expression). However, when FER is achieved by means of a classification system based on ML techniques, its reliability cannot (and should not) be separated from the reliability of its ground truth, i.e. training and test datasets (Cabitza et al., 2019). In this scenario, reliability is defined as the extent to which the categorical data from which the system is expected to develop its statistical model are generated from 'precise measurements', i.e. human 'recognitions' exhibiting an acceptable agreement. ...
Article
Emotion recognition, and in particular acial emotion recognition (FER), is among the most controversial applications of machine learning, not least because of its ethical implications for human subjects. In this article, we address the controversial conjecture that machines can read emotions from our facial expressions by asking whether this task can be performed reliably. This means, rather than considering the potential harms or scientific soundness of facial emotion recognition systems, focusing on the reliability of the ground truths used to develop emotion recognition systems, assessing how well different human observers agree on the emotions they detect in subjects’ faces. Additionally, we discuss the extent to which sharing context can help observers agree on the emotions they perceive on subjects’ faces. Briefly, we demonstrate that when large and heterogeneous samples of observers are involved, the task of emotion detection from static images crumbles into inconsistency. We thus reveal that any endeavour to understand human behaviour from large sets of labelled patterns is over-ambitious, even if it were technically feasible. We conclude that we cannot speak of actual accuracy for facial emotion recognition systems for any practical purposes.
... Humans are poor at making probabilistic decisions based on partial information and cannot even precisely calculate how data interfere with each other [31,41]. As already mentioned, in a patient dataset, some components (for instance, dentoalveolar adaptive remodelling) can remain below the threshold of perception of ML tools [42][43][44][45][46][47][48][49][50][51][52][53][54][55]. Some features may be irrelevant, missing, or redundant. ...
Article
Full-text available
Artificial intelligence (AI) models and procedures hold remarkable predictive efficiency in the medical domain through their ability to discover hidden, non-obvious clinical patterns in data. However, due to the sparsity, noise, and time-dependency of medical data, AI procedures are raising unprecedented issues related to the mismatch between doctors’ mentalreasoning and the statistical answers provided by algorithms. Electronic systems can reproduce or even amplify noise hidden in the data, especially when the diagnosis of the subjects in the training data set is inaccurate or incomplete. In this paper we describe the conditions that need to be met for AI instruments to be truly useful in the orthodontic domain. We report some examples of computational procedures that are capable of extracting orthodontic knowledge through ever deeper patient representation. To have confidence in these procedures, orthodontic practitioners should recognize the benefits, shortcomings, and unintended consequences of AI models, as algorithms that learn from human decisions likewise learn mistakes and biases.
... This aligns with research in conversational symptom checkers, which found that individuals dislike when questions are asked in a seemingly random or nonsensical order [66]. This is also consistent with arguments in clinical decision support that models and explanations ought to align with the way humans think about a problem to be adopted and trusted [8]. ...
Conference Paper
Conversational interaction, for example through chatbots, is well-suited to enable automated health coaching tools to support self-management and prevention of chronic diseases. However, chatbots in health are predominantly scripted or rule-based, which can result in a stagnant and repetitive user experience in contrast with more dynamic, data-driven chatbots in other domains. Consequently, little is known about the tradeoffs of pursuing data-driven approaches for health chatbots. We examined multiple artificial intelligence (AI) approaches to enable micro-coaching dialogs in nutrition - brief coaching conversations related to specific meals, to support achievement of nutrition goals - and compared, reinforcement learning (RL), rule-based, and scripted approaches for dialog management. While the data-driven RL chatbot succeeded in shorter, more efficient dialogs, surprisingly the simplest, scripted chatbot was rated as higher quality, despite not fulfilling its task as consistently. These results highlight tensions between scripted and more complex, data-driven approaches for chatbots in health.
Article
Full-text available
In supervised learning model development, domain experts are often used to provide the class labels (annotations). Annotation inconsistencies commonly occur when even highly experienced clinical experts annotate the same phenomenon (e.g., medical image, diagnostics, or prognostic status), due to inherent expert bias, judgments, and slips, among other factors. While their existence is relatively well-known, the implications of such inconsistencies are largely understudied in real-world settings, when supervised learning is applied on such 'noisy' labelled data. To shed light on these issues, we conducted extensive experiments and analyses on three real-world Intensive Care Unit (ICU) datasets. Specifically, individual models were built from a common dataset, annotated independently by 11 Glasgow Queen Elizabeth University Hospital ICU consultants, and model performance estimates were compared through internal validation (Fleiss' κ = 0.383 i.e., fair agreement). Further, broad external validation (on both static and time series datasets) of these 11 classifiers was carried out on a HiRID external dataset, where the models' classifications were found to have low pairwise agreements (average Cohen's κ = 0.255 i.e., minimal agreement). Moreover, they tend to disagree more on making discharge decisions (Fleiss' κ = 0.174) than predicting mortality (Fleiss' κ = 0.267). Given these inconsistencies, further analyses were conducted to evaluate the current best practices in obtaining gold-standard models and determining consensus. The results suggest that: (a) there may not always be a "super expert" in acute clinical settings (using internal and external validation model performances as a proxy); and (b) standard consensus seeking (such as majority vote) consistently leads to suboptimal models. Further analysis, however, suggests that assessing annotation learnability and using only 'learnable' annotated datasets for determining consensus achieves optimal models in most cases.
Chapter
Full-text available
From medical charts to national census, healthcare has traditionally operated under a paper-based paradigm. However, the past decade has marked a long and arduous transformation bringing healthcare into the digital age. Ranging from electronic health records, to digitized imaging and laboratory reports, to public health datasets, today, healthcare now generates an incredible amount of digital information. Such a wealth of data presents an exciting opportunity for integrated machine learning solutions to address problems across multiple facets of healthcare practice and administration. Unfortunately, the ability to derive accurate and informative insights requires more than the ability to execute machine learning models. Rather, a deeper understanding of the data on which the models are run is imperative for their success. While a significant effort has been undertaken to develop models able to process the volume of data obtained during the analysis of millions of digitalized patient records, it is important to remember that volume represents only one aspect of the data. In fact, drawing on data from an increasingly diverse set of sources, healthcare data presents an incredibly complex set of attributes that must be accounted for throughout the machine learning pipeline. This chapter focuses on highlighting such challenges, and is broken down into three distinct components, each representing a phase of the pipeline. We begin with attributes of the data accounted for during preprocessing, then move to considerations during model building, and end with challenges to the interpretation of model output. For each component, we present a discussion around data as it relates to the healthcare domain and offer insight into the challenges each may impose on the efficiency of machine learning techniques.
Article
Full-text available
Atrial fibrillation (AF) is a common cardiac rhythm disorder, which increases the risk of ischemic stroke and other thromboembolism (TE). Accurate prediction of TE is highly valuable for early intervention to AF patients. However, the prediction performance of previous TE risk models for AF is not satisfactory. In this study, we used integrated machine learning and data mining approaches to build 2-year TE prediction models for AF from Chinese Atrial Fibrillation Registry data. We first performed data cleansing and imputation on the raw data to generate available dataset. Then a series of feature construction and selection methods were used to identify predictive risk factors, based on which supervised learning methods were applied to build the prediction models. The experimental results show that our approach can achieve higher prediction performance (AUC: 0.71~0.74) than previous TE prediction models for AF (AUC: 0.66~0.69), and identify new potential risk factors as well.
Article
Over the past decade, machine learning techniques have made substantial advances in many domains. In health care, global interest in the potential of machine learning has increased; for example, a deep learning algorithm has shown high accuracy in detecting diabetic retinopathy.¹ There have been suggestions that machine learning will drive changes in health care within a few years, specifically in medical disciplines that require more accurate prognostic models (eg, oncology) and those based on pattern recognition (eg, radiology and pathology).
Article
Background: Understanding the validity of data from electronic data research networks is critical to national research initiatives and learning healthcare systems for cardiovascular care. Our goal was to evaluate the degree of agreement of electronic data research networks in comparison with data collected by standardized research approaches in a cohort study. Methods: We linked individual-level data from MESA (Multi-Ethnic Study of Atherosclerosis), a community-based cohort, with HealthLNK, a 2006 to 2012 database of electronic health records from 6 Chicago health systems. To evaluate the correlation and agreement of blood pressure in HealthLNK in comparison with in-person MESA examinations, and body mass index in HealthLNK in comparison with MESA, we used Pearson correlation coefficients and Bland-Altman plots. Using diagnoses in MESA as the criterion standard, we calculated the performance of HealthLNK for hypertension, obesity, and diabetes mellitus diagnosis by using International Classification of Diseases, Ninth Revision codes and clinical data. We also identified potential myocardial infarctions, strokes, and heart failure events in HealthLNK and compared them with adjudicated events in MESA. Results: Of the 1164 MESA participants enrolled at the Chicago Field Center, 802 (68.9%) participants had data in HealthLNK. The correlation was low for systolic blood pressure (0.39; P<0.0001). In comparison with MESA, HealthLNK overestimated systolic blood pressure by 6.5 mm Hg (95% confidence interval, 4.2-7.8). There was a high correlation between body mass index in MESA and HealthLNK (0.94; P<0.0001). HealthLNK underestimated body mass index by 0.3 kg/m(2) (95% confidence interval, -0.4 to -0.1). With the use of International Classification of Diseases, Ninth Revision codes and clinical data, the sensitivity and specificity of HealthLNK queries for hypertension were 82.4% and 59.4%, for obesity were 73.0% and 89.8%, and for diabetes mellitus were 79.8% and 93.3%. In comparison with adjudicated cardiovascular events in MESA, the concordance rates for myocardial infarction, stroke, and heart failure were, respectively, 41.7% (5/12), 61.5% (8/13), and 62.5% (10/16). Conclusions: These findings illustrate the limitations and strengths of electronic data repositories in comparison with information collected by traditional standardized epidemiological approaches for the ascertainment of cardiovascular risk factors and events.
Conference Paper
The departing point of this study is a data table with certainty values associated to attribute values. These values are deeply rooted in possibility theory, they can be obtained with standard procedures and they are efficiently manageable in databases. Our aim is to study rough set approximations and reducts in this framework. We define three categories of approximations that make use of the certainty value and generalize different aspects of the approximations: their equation, the binary relation used and the granulation. Further, new kinds of reducts aimed to make use or reduce the information provided by the certainty values are given.
Article
Patients, doctors, and the wider public need a better understanding of medicine’s limitations Uncertainty is ubiquitous in medicine. It can be seen in something as basic as a differential diagnosis or as complex as a new set of guidelines by a professional society. And yet uncertainty is often ignored as a subject in medicine, its importance underappreciated and its consequences suppressed. The public could be forgiven for regarding physicians as trafficking in certitude, producing diagnoses or summarising research with triumphant finality. To a large extent, we participate in that self delusion, and indeed encourage it. Despite early work,1 the systematic study of uncertainty did not begin in earnest until the 1990s. Since then, uncertainty has usually been studied in relation to professional development. A 2014 study, for example, reported that certain personality traits of general practitioners influenced their levels of anxiety about uncertainty; the authors note this may lead to resource overuse and medical …
Chapter
This chapter dicusses formalization of medical diagnosis from the viewpoint of rule reasoning based on rough sets. Medical diagnosis consists of the following three procedures. First, screening process selects the diagnostic candidates, where rules from upper approximations are used. Then, from the selected candidates, differential diagnosis is evoked, in which rules from lower approximations are used. Finally, consistency of the diagnosis will be checked with all the inputs: inconsistent symptoms suggest the existence of complications of other diseases. The final process can be viewed as complex relations between rules. The proposed framework successfully formalizes the representation of three types of reasoning styles.
Article
Mrs Smith has come back to see me with a constellation of symptoms that, despite two referrals, we are no further forward to understanding. Her examination, bloods, and imaging are normal. She has no past history, rarely consults, and isn’t anxious. I call her in with trepidation, secretly hoping she will tell me the symptoms have vanished. A career in general practice will offer plenty of uncertainty. Our accessibility means disease is seen earlier, and therefore symptoms may be vaguer. Being generalists we will encounter ‘unknown unknowns’ more often than our specialist colleagues and feel vulnerable about our knowledge. Then to make it really difficult, many symptoms are medically unexplained. As medical science explains more, the unexplained becomes less satisfyingly acceptable, to doctor and patient. Doctors in training often …