Content uploaded by Qian Yang
Author content
All content in this area was uploaded by Qian Yang on Jan 21, 2023
Content may be subject to copyright.
Harnessing Biomedical Literature to
Calibrate Clinicians’ Trust in AI Decision Support Systems
Qian Yang
Cornell University
Ithaca, NY, USA
qianyang@cornell.edu
Yuexing Hao∗
Cornell University
Ithaca, NY, USA
yh727@cornell.edu
Kexin Quan∗
University of California,
San Diego
San Diego, CA, USA
kquan@ucsd.edu
Stephen Yang∗
Cornell University
New York City, NY, USA
sy364@cornell.edu
Yiran Zhao∗
Cornell Tech
New York City, NY, USA
yz2647@cornell.edu
Volodymyr Kuleshov
Cornell Tech
New York City, NY, USA
kuleshov@cornell.edu
Fei Wang
Weill Cornell Medicine
New York City, NY, USA
few2001@med.cornell.edu
ABSTRACT
Clinical decision support tools (DSTs), powered by Articial Intelli-
gence (AI), promise to improve clinicians’ diagnostic and treatment
decision-making. However, no AI model is always correct. DSTs
must enable clinicians to validate each AI suggestion, convincing
them to take the correct suggestions while rejecting its errors. While
prior work often tried to do so by explaining AI’s inner workings
or performance, we chose a dierent approach: We investigated
how clinicians validated each other’s suggestions in practice (often
by referencing scientic literature) and designed a new DST that
embraces these naturalistic interactions. This design uses GPT-3 to
draw literature evidence that shows the AI suggestions’ robustness
and applicability (or the lack thereof). A prototyping study with
clinicians from three disease areas proved this approach promis-
ing. Clinicians’ interactions with the prototype also revealed new
design and research opportunities around (1) harnessing the com-
plementary strengths of literature-based and predictive decision
supports; (2) mitigating risks of de-skilling clinicians; and (3) oer-
ing low-data decision support with literature.
CCS CONCEPTS
•Information systems
→
Information extraction; •Applied
computing
→
Health care information systems;•Human-
centered computing →Empirical studies in HCI.
KEYWORDS
Clinical AI, XAI, Biomedical Literature, Qualitative Method
ACM Reference Format:
Qian Yang, Yuexing Hao, Kexin Quan, Stephen Yang, Yiran Zhao, Volodymyr
Kuleshov, and Fei Wang. 2023. Harnessing Biomedical Literature to Calibrate
∗Equal contribution.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from permissions@acm.org.
CHI ’23, April 23–28, 2023, Hamburg, Germany
©2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-9421-5/23/04. . . $15.00
https://doi.org/10.1145/3544548.3581393
Clinicians’ Trust in AI Decision Support Systems. In Proceedings of the
2023 CHI Conference on Human Factors in Computing Systems (CHI ’23),
April 23–28, 2023, Hamburg, Germany. ACM, New York, NY, USA, 14 pages.
https://doi.org/10.1145/3544548.3581393
1 INTRODUCTION
The idea of leveraging machine intelligence to improve clinical
decision-making has fascinated healthcare and Articial Intelli-
gence (AI) researchers for decades. Today, diverse AI systems have
proven their performance in research labs and are moving into
clinical practice in the form of Decision Support Systems (DSTs).
From Bayesian models that predict treatment outcomes based on
Electronic Health Records (EHR) [
67
] to computer vision systems
that interpret medical images [
39
], from rule-based systems that
alert drug interactions [
10
] to wearable-sensing AIs that monitor
disease progression [
18
,
56
], AI-powered DSTs promise to reduce
clinician decision errors and improve patient outcomes.
While all clinical AI models can oer valuable diagnostic or
treatment suggestions, none is always correct. Therefore, clinicians
must calibrate their trust in each AI suggestion on a case-by-case
basis. DST interaction designs can help. Ideally, clinical DSTs can
provide information that enables clinicians to adopt only the correct
AI suggestions while staying unbiased by AI errors.
This is a challenging goal. Amongst other approaches, existing
DST designs most often supported clinician-AI trust calibration by
explaining how the AI generated its suggestions and how well it
performed on past patient data [
26
,
35
,
54
,
73
]. This “explanation”
approach struggled in clinical practice: When the clinician’s hy-
pothesis was wrong and the AI advice was correct, the explanations
rarely persuaded clinicians to take the advice [
14
,
26
,
32
,
36
,
41
,
74
].
When the AI suggestion was wrong, explanations could also fail to
help clinicians notice the error [
21
,
73
,
75
]. These failures propelled
some researchers to design more persuasive AI explanations, for
example, by presenting only parts of the explanations that justify
the AI’s output [
17
]. However, these designs are emergent, and they
have not yet empirically demonstrated their eectiveness.
This paper aims to identify new DST designs that can eectively
calibrate clinicians’ trust in AI suggestions on a case-by-case basis,
enabling them to take only correct suggestions while rejecting its
errors. Instead of exploring new ways to explain the AI and testing
them on clinicians afterward, we chose to start by investigating how
CHI ’23, April 23–28, 2023, Hamburg, Germany Yang et al.
clinicians validated each other’s diagnostic or treatment suggestions
in practice. We hoped that these naturalistic interactions would
reveal new insights into the kinds of information clinicians need
to calibrate their trust in AI suggestions. Given that biomedical
literature (e.g., reports of randomized controlled trials) is known
to play a central role in this process [
31
] and that clinicians had
rejected AI suggestions because “they have not been published in
prestigious clinical journals” [
25
,
75
], we paid particular attention
to how clinicians sought and used evidence from the literature.
Through contextual inquiries with 12 clinicians and their assis-
tants, we found that clinicians rarely explained how they came
up with the suggestion when exchanging diagnostic or treatment
suggestions. Instead, they sought all evidence that may validate or
invalidate the suggestion from biomedical literature, their shared
source of truth. They then examined these pieces of evidence based
on the evidence’s applicability to the specic patient situation at
hand, thereby concluding whether to accept the suggestion.
Embracing these ndings, we designed a novel form of DST that
imitates clinicians’ natural trust-calibration interactions. Rather
than explaining how the AI generated its suggestion, the DST pro-
vides literature evidence that can potentially validate or invalidate
the suggestions. In presenting the evidence, the DST highlights its
applicability to the particular patient situation in question, rather
than how rigorously it was concluded from past patient cases. A co-
prototyping study with 9clinicians conrmed these design strate-
gies. Further, clinicians’ interactions with the prototype surfaced ad-
ditional design and research opportunities around designing more
intuitive clinician-AI trust-calibration interactions and harnessing
biomedical literature for creating such interactions.
This paper presents the initial contextual inquiries, the new
DST design, and ndings from the prototyping study. It makes
three contributions. First, this work provides a rare description of
how clinicians sought information to validate each other’s diag-
nostic/treatment suggestions in their practice. It provides a timely
answer to researchers’ call for a use-context-focused approach to
designing explainable AI [
43
]. Second, this work identies an alter-
native approach to calibrating clinicians’ trust in AI. Our prototype
exemplies one possibility in this new design space. Third, the
prototyping study oers an initial description of how clinicians
deliberate clinical decisions using a predictive DST and biomedical
literature simultaneously. Both clinical DST and Biomedical Natural
Language Processing (biomedNLP) technologies are rapidly matur-
ing. This research oers a valuable reference for future research
that harness both technologies to improve clinical decision-making.
2 RELATED WORK
2.1 Challenges of Validating AI Suggestions
with “Explainable AI”
Across the elds of HCI and AI, extensive research has studied
how to calibrate clinicians’ trust in AI. Much of this research can
trace its origin to “explainable AI” literature that aims to make
AI less like a “black box” [
16
,
42
,
73
]. As a result, when designing
AI-powered DSTs, designers and researchers most often oered
clinicians explanations of the AI’s inner workings. The explana-
tions include, for example, descriptions of training data, its machine
learning model’s training and prediction processes, and the model’s
performance indicators [
53
,
54
,
66
,
74
]. Practitioner-facing tools
(e.g., Microsoft’s HAX “responsible AI” toolkit, data sheets [
24
],
model cards [
46
]) further boosted this approach’s real-world im-
pact. Experiments showed that AI explanations improved clinicians’
satisfaction with DSTs and increased the likelihood of them taking
the AI’s advice [9, 42, 51, 60].
However, because AI suggestions are not always correct, “increas-
ing the likelihood of clinicians taking the AI’s advice” did not mean
that clinicians made better decisions. To make better decisions, clin-
icians must calibrate their trust in each AI suggestion individually,
on a case-by-case basis. Existing AI explanations largely failed in
this regard. In clinical practice, the AI explanations were either
unable to persuade clinicians to take the AI’s correct advice (under-
trust) [
14
,
26
,
32
,
36
,
41
,
73
–
75
], or failed to enable clinicians to
reject AI’s errors (over-trust) [21, 32].
Recent research has identied several causes of AI explanations’
repeated failure.
•
AI explanations are not always available. Some clinical AI
systems (such as deep learning systems for medical imaging [
32
])
are uninterpretable even to AI researchers, much more so to
clinicians.
•
Too much information, too little time. In clinical practice,
clinicians make dozens, if not hundreds, of decisions in a day [
19
,
27
], while each AI informs a decision. It can seem unrealistic to
expect clinicians to comprehend all these AIs’ inner workings in
addition to caring for patients [26, 34, 61, 70, 75].
•
AI explanations are not intuitive or actionable to clini-
cians. In fast-paced clinical practice, clinicians need actionable
information [
13
,
14
]. Actionable information in the medical world
typically means that the action has a causal relationship with
desired patient outcomes, and ideally this causal relationship
has been proven by randomized clinical trials on a large popula-
tion [
7
,
30
]. In contrast, AI models make personalized, data-driven
predictions. The validity and actionability of AI suggestions, there-
fore, are counter-intuitive to many clinicians [
75
]. Studies showed
that, even when clinicians understood how an AI works, they
could not decide whether to take its suggestion. Instead, clin-
icians requested results from prospective randomized clinical
trials that can show the causality between the AI features and
patient outputs. They wanted these results to come from top
medical journals [
25
,
34
,
74
,
75
]; They wanted the information to
be trustworthy inherently, rather than requiring them to validate
it in the midst of their clinical decision-making [
34
]. For most
clinical AI models, such evidence does not exist.
Recent research has started improving AI explanation designs to
address these challenges. Some worked to make uninterpretable
models more interpretable by appending purpose-built explanation-
generation models to the DST [
71
], or oering counterfactual pre-
dictions demonstrating how changes to the values of AI features
may impact their outputs [
8
,
38
,
72
]. Others addressed the “too
much information, too little time” problem by presenting only se-
lective or modular explanations to clinicians, the explanations that
best justify the AI’s suggestion [
3
,
17
] or can best address clinicians’
potential biases [
35
,
73
]. These approaches are nascent and have
not yet demonstrated an impact on clinicians’ decision quality.
Harnessing Biomedical Literature to Calibrate Clinicians’ Trust in AI Decision Support Systems CHI ’23, April 23–28, 2023, Hamburg, Germany
Interestingly, for few other researchers, the challenges listed
above suggest that calibrating clinician-AI trust by explaining the
AI is a “false hope” entirely; clinicians need dierent kinds of infor-
mation to validate AI suggestions [
16
,
25
,
26
]. These researchers
argued that AI explanations’ lack of actionability—one cannot de-
cide whether an AI suggestion is right or wrong, even if one fully
understands how the AI generated its suggestion—is a fundamental
shortcoming that better explanation designs cannot x. Instead
of improving on the explanation designs, they promoted the idea
that designers should take a step back and work to understand
clinicians’ information needs in validating AI suggestions, with
considerations of their “identity, use contexts, social contexts, and
environmental cues” [
35
,
43
]. This call for action is a key motivator
for this research.
2.2 How Clinicians Validate Clinical
Suggestions in Practice
To bootstrap our investigations into how clinicians validate care
suggestions in their daily practice, we reviewed the literature on
clinical hypothesis testing and decision-making. We noted three
themes in this work: evidence-based best practices, the interpretive
nature of many clinical decisions, and tool use.
Best practices: Evidence-Based Medicine (EBM). A culture of
best practice dominates the medical world, therefore medical text-
books and best practices can oer some indications of clinicians’
actual practice. Known as Evidence-Based Medicine (EBM), clini-
cal best practices require clinicians to use "clinical evidence" when
examining patient-care-related suggestions and making decisions.
"Clinical evidence" includes, for example, results of randomized
controlled trials, standard care procedure concluded from multiple
trials, the biological relationships between patients’ genome proles
and their treatment outcomes, and expert opinions on noteworthy
case studies [
11
,
29
,
31
,
57
,
63
]. Biomedical literature is central to
EBM best practices, because it documents all of these clinical evi-
dence across medical domains [
55
,
65
]. In medical schools, students
learn to search for and user evidence in the literature to validate
diagnostic or treatment hypotheses [30].
Medical textbooks instruct clinicians to, before adopting a piece
of clinical evidence, examine both its rigor (How eectively did it
establish the causal eect of the intervention and patient outcome?)
and its applicability (Does the study result apply to their particular
patient?) [
64
]. In examining rigor, clinicians should prioritize results
from large-scale cohort studies over those from published case stud-
ies, according to the best practice known as the Level of Evidence
Pyramid [
48
]. In examining applicability, clinicians should identify
evidence that ts their patient’s
p
opulation, the
i
ntervention un-
der their consideration, intervention
c
omparator, and the patient
outcome of their interest, known as the PICO framework.
Empirical research showed that clinicians indeed use literature
for decision-making in practice [
20
,
62
]. When clinicians disagreed
on a diagnosis or treatment, they often used literature as a shared
source of truth [
52
,
59
,
75
]. The ways in which clinicians reacted
to some AI explanations (e.g., requesting evidence of causality, see
section 2.1) can seem reective of these best practices.
Literature-based decision-support tools. When validating patient-
care-related hypotheses or suggestions, clinicians most often used
literature-based tools. These tools reect both the EBM best prac-
tices and their need to rapidly mine seas of literature. For example,
clinicians across disease areas use PubMed, a biomedical litera-
ture search engine [
20
,
49
], which allows clinicians to search for
clinical evidence based on their levels of rigor (e.g., randomized
controlled trials, controlled trials, cohort studies, etc.) Many also
used manually curated literature digest apps such as UpToDate [
22
]
and Journal Club [
1
]. In addition, some hospitals hire in-house
clinical librarians to assist clinicians in the literature search for
decision-making [23, 37].
Recently, (semi-)automated literature digest tools have emerged,
thanks to the rapid advances in biomedical Natural Language Pro-
cessing and pre-trained language models such as GPT-3 and BERT [
40
].
For example, healthcare researchers have started building literature
tools that search for clinical trials that match a patient situation
based on the PICO framework [
50
,
58
]. Such tools are available
across multiple disease areas, such as cardiology, psychiatry, and
COVID-19.
Social decision support and the interpretive nature of real-
world clinical decisions. Notably, the situated and interpreta-
tive nature of many clinical decisions requires knowledge beyond
what’s proven eective in past patient cases or clinical trials [
61
].
For example, for patients who arrived in Emergency Rooms while
crushing and dying, clinicians often had to make decisions despite
incomplete information, not to mention experimentally-proven
evidence. In end-of-life treatment decision-making, the patient’s
personal value is also often weighed vis-à-vis scientic evidence of
treatment outcome [
76
]. In these decision contexts, clinicians re-
sisted data-driven decision “evidence” provided by AI and preferred
"social decision support" from colleagues: consulting each other and
drawing from others’ tacit knowledge and experience [28, 75].
However, little empirical research has investigated how clin-
icians validated decision hypotheses (through literature, phone
consultation, or others) and made care decisions as it naturally
occurs. Instead, most work simply noted whether clinicians did
or did not deviate from the EBM best practice [
15
,
33
]. This line
of research identied clinicians’ lack of time as one of the most
signicant barriers to practicing EBM at the point of care [15, 33].
3 METHOD
We wanted to explore new DST designs that can help clinicians
validate its AI’s diagnoses and treatment suggestions on a case-by-
case basis, convincing clinicians to take AI’s correct advice while
rejecting its errors. Prior research has created many thoughtful DST
designs by explaining how the AI worked. However, such explana-
tions can seem vastly dierent from what clinical best practices in-
struct clinicians to seek when deliberating a care decision. Our work
attempts to bring these strands of related work together. We rst
worked to understand how clinicians validated each other’s care
suggestions in current practices (study 1). Building upon this empir-
ical understanding, we then investigated how similar approaches
might help clinicians validate AI’s advice (study 2).
CHI ’23, April 23–28, 2023, Hamburg, Germany Yang et al.
Study 1 Clinical Domain Experience Study 2 Clinical Domain Experience
P1 Neurology 10-30 yrs P13 Internal Medicine Over 30 yrs
P2 Cardiology 10-30 yrs P14 Medical Student 2-5 yrs
P3 Pediatric Neurology 10-30 yrs P15 Psychiatry 2-5 yrs
P4 Cardiology 10-30 yrs P16 Psychiatry 1-2 yrs
P5 Neurogenetics 10-30 yrs P17 Medical Informatics 1-2 yrs
P6 Nephrology 10-30 yrs P18 Pediatric ER Over 30 yrs
P7 Pathology 5-10 yrs P19 Pharmacology 5-10 yrs
P8 Pediatric Nephrology 10-30 yrs P20 Nursing 1-2 yrs
P9 Clinical Librarian 10-30 yrs P21 Family Medicine 10-30 yrs
P10 ER 2-5 yrs
P11 Clinical Librarian 10-30 yrs
P12 Clinical Librarian 5-10 yrs
Table 1: Study participants.
3.1 Stage 1: Investigating Natural Interactions
The rst study aims to understand how clinicians examined each
other’s clinical suggestions in practice. Because biomedical liter-
ature is known to play a central role both in clinicians’ every-
day decision-making and their rationale for resisting AI sugges-
tions [
7
,
30
], we chose to focus on how clinicians sought, priori-
tized, and synthesized information from the literature to validate
care suggestions. We paid particular attention to whether and how
clinicians’ behaviors deviated from best practices under the time
pressure of patient care.
We conducted IRB-approved semi-structured interviews (with
components of contextual inquiry) with 9clinicians and 3clin-
ical librarians who these clinicians routinely hired to help with
decision-making. We intentionally recruited participants from dif-
ferent clinical roles and specialties, ranging from nephrology physi-
cians (P1, P6), pediatric physicians (P3, P8), cardiology physicians
(P2, P4), a pediatric geneticist (P5), an emergency medicine medical
intern (P10), and a pathologist (P7). We recruited initial participants
from our collaborating hospitals and then expanded the set through
snowball sampling. We conducted the interviews remotely. Each
interview lasted for about 60 minutes.
In each interview, we started by inviting the participant to re-
call in detail a recent experience where they needed to examine a
hypothesis or suggestion regarding patient care. We asked them
to describe the specic patient situation and the broader clinical
contexts. If and when they started describing their literature search,
we invited them to share their screen and (re-)perform their search
using the tools they used in actual practice (e.g., PubMed, Google
Scholar.) We invited them to think aloud, detailing how they sought,
read, interpreted, and synthesized various information (within and
beyond the literature) in relation to the specic patient decision
and hypothesis in question. We asked follow-up questions to better
understand the motivations that drive their information needs and
advice-taking or -rejection behaviors.
We screen/video recorded and transcribed all interviews. We
analyzed the data using a combination of anity diagrams [
47
],
service blueprinting, and axial coding [
5
]. Through anity diagram-
ming, we analyzed (1) what information clinicians sought from the
literature, (2) how they synthesized the information for the situ-
ated and interpretative patient situation at hand, and (3) how they
prioritized such information when under time pressure. Further,
service blueprinting allowed us to trace theses information ow
across the patient situation in question, clinicians’ literature search,
and clinicians’ multiple decisions over the course of a patient’s care.
Next, we performed axial coding [
12
] to consolidate the perspec-
tives of the clinicians who gave and received clinical suggestions.
Finally, we conrmed our ndings with four additional clinical
professionals (a professor of biomedical informatics, a professor of
medicine, and two orthopedic surgeons.)
3.2 Stage 2: Imitating Natural Interactions
After understanding how clinicians naturally validated each other’s
suggestions, we aimed to examine whether similar interactions
can help clinicians validate AI suggestions. Towards this goal, we
designed and prototyped a new form of DST that imitates clinicians’
naturalistic interactions observed in study one. We then used the
prototype as a Design Probe [
44
,
45
], conducting IRB-approved in-
terviews with clinicians for feedback and also iteratively improving
the design based on their feedback.
We prototype three versions of the DST design, each focusing
on one disease area (neurology, psychiatry, and palliative care.) To
create a realistic user experience (UX), we populated the prototypes
with three retrospective patient cases from top medical journals.
We worked with two clinicians in selecting the cases and removed
privacy-sensitive information from the clinical narratives. Finally,
we populated the DST prototypes with AI diagnosis or treatment
suggestions for these patients, using previously published, open-
source ML models [68, 69].
To elicit honest feedback on the design, we recruited additional 9
clinicians who had not been interviewed earlier. We used the same
recruitment process as in stage one. Study 2 participants come from
Internal Medicine, Psychiatry, family medicine, emergency care,
among other domains.
Harnessing Biomedical Literature to Calibrate Clinicians’ Trust in AI Decision Support Systems CHI ’23, April 23–28, 2023, Hamburg, Germany
Decision Suggestions under Deliberation Evidence Embedded in the Literature
Decide what information to inquire from a patient and
actions to take during a patient visit;
Latest standard care procedures for particular symptoms; etiology;
Interpret lab test results and make diagnosis The sensitivity and eectiveness of the lab test; dierential diagnosis;
Make prognosis and treatment decisions
Dierent treatment options’ anticipated outcomes based on outcomes of past
patients similar to the patient situation in question (in terms of, for example,
comorbidity).
Table 2: Clinicians share various literature evidence when they suggest dierent clinical actions to colleagues. The left column
lists the clinical actions. The right column lists the information in the literature clinicians considered as “evidence” for that
action.
In each interview, we rst described the need for clinicians to
deliberate whether to accept an AI-generated care suggestion. With-
out showing our prototype, we probed the participants’ reactions
to the idea of validating AI suggestions with literature. These re-
actions reect clinicians’ unbiased opinions regarding validating
AI suggestions with literature. Next, we showed the participants
a prototype that matched their medical expertise. We invited the
participants to think aloud while they read the patient information,
DST suggestions, and literature evidence. We asked follow-up ques-
tions to investigate how the literature evidence helped (or failed to
help) them deliberate whether to accept the AI suggestions. Finally,
we collected additional feedback by inviting clinicians to improve
the prototype. We audio and video-recorded the interviews and
analyzed the data using the same methods as Stage 1 (see 3.1).
4 STAGE ONE FINDINGS: HOW CLINICIANS
NATURALLY VALIDATE SUGGESTIONS
In study one, our interviews conrmed that clinicians use biomedi-
cal literature to validate colleagues’ care suggestions and to answer
their “tough questions”, especially around rare patient cases. The
literature is clinicians’ shared source of truth. While all clinicians
we interviewed said that they were “very very busy” and did not
have time to read, almost all had looked up literature within the
past two weeks, some even within the past few days.
Below we describe ndings around the kinds of information
clinicians used to deliberate care suggestions: (1) Clinicians provide
evidence that can validate (or invalidate) their suggested diagnosis or
treatment, rather than explanations of the suggestion; (2) Clinicians
consider as a piece of “perfect evidence” as evidence that ts the
patient situation at hand “perfectly”. Such evidence rarely exists;
(3) When the “perfect evidence” does not exist, clinicians curate and
synthesize imperfect evidence to make decisions; and (4) Clinicians
utilize time-saving strategies to perform ecient reading, curation,
and synthesis.
4.1 Don’t Explain. Provide Evidence Instead.
When exchanging clinical suggestions, clinicians rarely explained
how they came up with the suggestions (e.g., a suggested diagnosis,
a lab test order, or a medication order). Instead, they shared a list
of “evidence” from biomedical literature. Table 2 summarizes the
clinical suggestions clinicians have referenced literature for and the
information clinicians considered as “evidence” of the suggestions.
All clinicians we interviewed made an explicit eort to seek and
share a list of evidence that is scientic. The resulting list is scien-
tic in two ways. Firstly, it is comprehensive. Clinicians worked to
ensure that they provided all supporting and opposing evidence
of their suggestion in the literature. For example, when suggesting
a treatment via email, one clinician shared “the entire (literature)
search results” to demonstrate the evidence list is “unbiased by rank-
ing algorithms” (P9). Others highlighted the latest literature on the
list, signaling that they included “all the latest data” and treatment
options (P1, P2). Secondly, the evidence list is reproducible. Clin-
icians often shared the link they used for their evidence search
such that their colleagues can “get the same results each time” when
examining the care suggestion and its list of evidence.
The comprehensive and reproducible evidence list served dual
purposes in advice-giving-and-taking among clinicians. The list
is persuasive. It demonstrated that the “best evidence available”
supports the clinician’s suggestion. On the other hand, the list
allowed the clinician’s colleagues to scrutinize the supporting and
opposing evidence rst-hand and draw their own conclusions.
4.2 The Myth of the Perfect Evidence
We use the term “perfect evidence” to refer to the kind of information
that can immediately trigger clinicians’ acceptance or rejection of
a piece of clinical advice. In alignment with medical textbooks,
clinicians characterized the perfect evidence as both rigorously
generated (robustness) and highly applicable to the patient situation
they face (applicability). However, interestingly, their criteria for
evidence applicability far exceed what textbooks require.
When validating the rigor of a piece of literature evidence, clini-
cians used several proxies, such as the size of the patient population
it has been tested on, publication venues, and authors. Some also
checked the disclaimer part of the literature, assessing whether the
clinical evidence on branded treatments was published for prot
rather than in the patient’s best interest (P2). Overall, the ways in
which clinicians validated evidence rigor aligns with clinical best
practices, though much faster. “It’s the reviewers’ job, not mine, to
invest in the time to see if the study is legit.” (P8)
In validating the applicability of a piece of literature evidence,
however, clinicians spent ample time and much cognitive energy.
When reading clinical trial reports and case studies, almost all clin-
icians we interviewed jumped right to the rst table in the article,
which describes the demographic information of the patient or
CHI ’23, April 23–28, 2023, Hamburg, Germany Yang et al.
patient population. They then worked on mapping the patient pop-
ulation onto their patients’ situation. “[I] looked at what happened
to these patients under what particular circumstances in the past,
and sort of deciding whether they could apply to my patient.” (P6)
Some did this mentally. Others, especially the clinical librarians
who routinely handle unusually complex patient cases, used the
PICO (Population, Intervention, Comparator, Outcome) framework
to match patient situations explicitly.
A pursuit of literature evidence that matches their patient’s situ-
ation “perfectly’ drives clinicians’ search. Their denition of perfect
applicability far exceeds what best practices demand (e.g., PICO
match.) To all clinicians we interviewed, perfection means that
the patient population in the clinical trial matches not only their
patient’s demographics but “all characteristics of the patient” (P4),
including age, sex, weight, presence of comorbidities, past disease
trajectories, previous interventions, and even family history. P8,
a pediatric nephrologist, provided a particularly vivid example.
When treating a preschool-aged female patient with Liddle syn-
drome and a family history of aldosterone-mediated hypertension,
P8 suspected that the patient might have monogenic hypertension
(a type of genetic high blood pressure.) He had run preliminary
laboratory tests and looked up literature to decide what other tests
he should order to conrm the diagnosis. He was looking for a
testing procedure that has been proven eective on patients of the
same gender and age, with the same disease variation (Liddle syn-
drome and monogenic hypertension), family history, and genetic
inheritance pattern (e.g., the disease was similarly symptomatic
in some family members). The evaluation of the suggested testing
procedure “had to t within those parameters” to convince him to
carry out the suggestion immediately.
The perfect evidence does not exist for a vast majority of pa-
tient cases, especially for treatment suggestions. Combining the
various criteria clinicians outlined above, the perfect evidence for
the treatment of a patient is the treatment evaluation result from a
“large-scale, well-conducted” randomized controlled trial, in which
the trial participants are similar to the patient at hand in all charac-
teristics. This is an exceptionally high bar, if at all achievable. Fur-
ther, clinicians only seek help on unusual patient situations or rare
diseases, making the perfect clinician-evidence-patient-situation
match even less likely.
4.3 Making Decisions with Imperfect Evidence
Clinicians acknowledged that perfect evidence that can immediately
trigger action is rare. After conrming that the perfect evidence
does not exist for their patient, clinicians collected pieces of imper-
fect evidence; evidence that is useful but “non-conclusive” (P6). “You
want to look for something else also.” (P10)
Clinicians undertook a three-step process to identify and syn-
thesize the list of imperfect evidence to conclude a care decision.
First, clinicians collected a set of evidence from less “heavy hitting
journals. For example, they looked for insights from cohort studies,
case-controlled studies, and case reports (rare patient cases that
other clinicians encountered and published in peer-reviewed jour-
nals) (P4, P6, P10). When necessary, clinicians even referenced what
they described as “obscure“ (P3) or “gray literature” (P12), such as
posters and conference meeting abstracts.
Next, if even less prestigious literature failed to reveal perfectly
applicable evidence, clinicians relaxed their evidence applicability
criteria. Using the PICO framework as a scaold, clinicians rst
looked for studies or historical cases whose patient population and
study setting overlap most with the patient situation they face.
They referred to such evidence as "second-best matches". They then
relaxed the search criteria further (for example, by including studies
that cover a slightly dierent age group or a dierent subcategory
of the same disease) until they had exhausted applicable evidence.
Finally, clinicians worked to identify a “trend” or “narrative” that
can connect the pieces of evidence they collected from diverse
sources with dierent levels of applicability. They again used the
PICO framework as a scaold. For example, to conclude a treatment
decision, clinicians rst “guesstimated” the likely treatment out-
come of a patient population that shares some characteristics with
their patient. They weighed competing evidence about the same
population dierently according to its robustness. Next, clinicians
repeated this synthesis process on a patient population that shares
another set of characteristics with their patient. This iterative pro-
cess continues until they see a trend in the likely treatment outcome
when these various patient characteristics meet. As P6 describes,
“It’s essentially looking at that and sort of deciding what happened to
those patient [...] if they could apply to my patient." Clinicians took
action on their patient according to this evidence trend.
It is not a coincidence that, when clinicians failed to nd the
perfect evidence, all of them relaxed the robustness criteria to look
for the “second-best” patient-situation matches. When synthesizing
evidence, clinicians used the PICO patient-situation match as a scaf-
fold. Overall, the clinicians prioritized applicability over robustness
when seeking and synthesizing evidence. They considered a highly
applicable case study more informative than a well-conducted, large-
scale randomized trial on a dierent patient population. P3’s rec-
ollection of a recent experience punctuates the length clinicians
would go to in order to harness a piece of less-robust but highly
applicable information. When exploring treatments for a rare dis-
ease in a patient, P3, a neurologist, came across an ongoing clinical
trial that ts the patient’s demographics and history. He went on to
contact a friend who works at the same hospital as the author (the
clinician who was leading the trial), got introduced to the author,
sought their “tips and tricks on how we are going to be successful with
this [disease management]”, went on to contact the drug company,
and nally got the drug under trail for the patient. Clinicians value
the evidence that ts their patient’s situation well, in this case, even
when the evidence is still under development.
“With Dr. [name], I wanted to know if he thought the
drug would work. And then, if it did, how do I get it?
How do I use it? What do I need to do? What resources
do I need? How we are going to be successful with this?”
4.4 Time-Saving Strategies
The interview ndings so far have shown that, when deciding
whether to take a clinical suggestion, clinicians set an exceptionally
high standard for decision evidence and thoughtfully curated and
synthesized a comprehensive list of imperfect evidence when this
standard is too high to reach. Noteworthily, clinicians also carried
Harnessing Biomedical Literature to Calibrate Clinicians’ Trust in AI Decision Support Systems CHI ’23, April 23–28, 2023, Hamburg, Germany
out a set of time-saving strategies to make this process as ecient
as possible.
Reading nothing but the evidence, its robustness, and its ap-
plicability. During the contextual inquiries, no clinicians read a
clinical trial or case study entirely. Instead, they skipped to the pop-
ulation description to assess the study’s applicability, then skipped
to the Method section to validate its robustness. If the study met
both criteria, clinicians skipped to results to nd out whether the
result “was positive or not” (P6). Following this process, it takes
clinicians less than one minute to extract the evidence they need
from an article of dozens of pages.
“The only thing that matters to me is that: No survival
benet with the addition of ADT (a medication). I won’t
read into the p-values and the percentages. All I care
about is whether [outcome] is positive.” (P10)
Clinicians also deemed literature with neutral outcomes as ir-
relevant because the information is not actionable. They stopped
reading as soon as they realized the study outcome was neutral.
Synthesizing evidence (only) until it can justify action. While
clinicians’ literature synthesis for decision-making was compre-
hensive and systematic, the synthesis only went as far as it could
justify a clinical action. For example, if the suggested clinical action
is to order an additional lab test to conrm a potential diagnosis,
clinicians only need a relatively weak set of evidence to take the
suggestion because the cost of running the extra test is low, while
the cost of missing a diagnosis is high. In this sense, clinicians in-
form patient decision-making with literature evidence much faster
than a literature review process in research settings.
Reuse literature evidence over the course of a patient’s care. Be-
cause clinicians sought literature evidence that matched the patient
situation at hand, the same set of literature can inform multiple clin-
ical decisions over the course of the patient’s care (Table 2.) When
rst meeting the patients, clinicians accumulated "landmark paper"
that can oer “a rough guideline” for care decisions but do not en-
tirely match the patient’s situation (P1, P6). As clinicians gathered
more understanding about the patient situation (e.g., via diagnostic
tests), they gradually turned to more precisely relevant clinical
trials and even historic patient cases. The pieces of evidence they
collected along the way progress from broad-stoke to case-specic,
and all can potentially inform subsequent care decisions.
5 THE NEW DESIGN
We have so far described how clinicians examined each other’s
suggestions in current practice. Now we shift to consider how these
naturalistic interactions inform how DSTs can help clinicians more
eectively examine AI suggestions.
Recall that existing DSTs calibrated clinician-AI trust primarily
by explaining to clinicians how the AI has generated its suggestions.
Interestingly, clinicians used a very dierent type of information
when deliberating on each other’s clinical suggestions: biomedical
literature. Specically, we highlight four characteristics of literature
information that made it eective.
(1)
Providing evidence, rather than explanations. Clinicians
deliberated on colleagues’ care suggestions, not by how the
colleague came up with the suggestion, but based on external
evidence concluded from entirely dierent processes (e.g., clini-
cal trials). Using this information, clinicians could stay focused
on analyzing “whether this is the right diagnosis or treatment
for this patient" rather than around “whether the suggestion is
trust-worthy” by itself.
This providing-supporting/opposing-evidence approach can po-
tentially address several challenges AI explanations face in cali-
brating clinician trust in AI suggestions. First, external evidence
that can support or oppose a suggested diagnosis/treatment is
always available, regardless of whether the AI models are in-
terpretable or not. Second, it can save clinicians’ time because
clinicians are likely to be much more ecient in analyzing
“whether this is the right diagnosis or treatment for this patient"
than scrutinizing AI’s inner workings. Finally, it can keep clini-
cians’ cognitive work on clinical decision-making rather than
AI-explanation sense-making.
(2)
Providing both evidence of robustness and applicability,
respectively.Prior research has shown that clinicians did not
consider explanations of how an AI generated its suggestions
as actionable. The explanations often failed to sway clinicians
to accept or reject the AI’s suggestion [
36
,
41
]. But why? Our
ndings revealed that a combination of two types of information
could trigger clinicians’ advice-taking behavior: evidence of the
suggestion’s robustness (also known as “internal validity” [
64
])
and of its applicability to the patient situation at hand (“external
validity”). However, the perfect evidence that indicates both
rarely exists. Therefore, clinicians assessed these two types
of evidence separately: A study’s population size indicates its
robustness. Its patient population match indicates applicability.
Existing DST designs did not always make this distinction. In-
stead, many DSTs unknowingly mixed evidence of its AI sug-
gestion’s robustness (e.g., evidence that the model is high per-
forming on the patient population it is originally trained on)
with evidence of its applicability (e.g., evidence that the model
is high performing on a patient population that “perfectly” re-
sembles the patient in question.) Further, evaluating a model’s
performance on “patients like this patient in all characteristics” is
rare and challenging. In this light, it can seem unsurprising that
AI performance indicators also struggled to persuade clinicians
to accept or reject AI suggestions.
(3)
Providing comprehensive rather than selective evidence. A
critical challenge in DST design is to calibrate clinicians’ trust
in AI under the time pressure of clinical practice. Some DST
designs addressed this challenge by providing clinicians with
selected information about the AI, such as a subset of features
that can best justify the AI suggestion [
26
,
73
]. Our interview
ndings suggest an alternative perspective to this approach. To
clinicians, the trustworthiness of a clinical suggestion comes
from the fact that the entirety of its supporting evidence, in
aggregate, outweighs the entirety of its opposing evidence. Clin-
icians saved time by reading only the critical takeaway from
every piece of evidence rather than reading particular pieces
of evidence. DSTs might play a more critical role in calibrating
clinician trust if it oered a comprehensive but succinct set of
evidence and counter-evidence.
CHI ’23, April 23–28, 2023, Hamburg, Germany Yang et al.
Figure 1: An initial prototype of the new DST design. This design helps clinicians deliberate whether to take an AI’s suggestion
by providing a list of supporting and opposing evidence from biomedical literature. The top panel highlights the patient’s
basic information (e.g., demographics). The columns on the left overview the patient’s medical history and test results (in this
example, psychiatric evaluation results). AI’s treatment suggestion, along with its supporting and opposing evidence from
biomedical literature, are on the right.
(4)
Drawing the evidence from a shared source of truth. When
a clinician shared a suggestion with a colleague, they used
biomedical literature because it is a shared source of truth for
both the advice-giver and the advice-taker. As a result, the
clinicians did not need to elaborate on the trustworthiness of
the literature evidence itself; they were both aware, for example,
that randomized trials are more robust than historical case
studies. Instead, they used the literature as a shared language
and both focused on assessing the evidence’s applicability to
the specic patient situation at hand.
The same cannot be said about AI explanations. Explanations
around AI’s inner workings are the language of the AI (the
advice-giver) rather than that of the clinicians (the advice-taker.)
The struggles clinicians had in interpreting AI explanations
can be seen as a symptom of this lacking-a-shared-source-of-
truth problem (for example, they wanted evidence of causality
between AI’s features and patient outputs [74, 75]).
5.1 Design Goal and Strategies
We wanted to design a new form of DST that can help clinicians
more eectively examine AI suggestions, such that they take only
the correct suggestions and reject the AI errors. Embracing clini-
cians’ naturalistic interactions, we identied the following design
strategies:
(1)
Drawing the evidence from information sources that clinicians
already understand and trust.
(2)
Providing evidence for/against the AI’s suggestion, rather than
explaining the AI’s inner workings.
(3)
Providing evidence of the AI suggestion’s rigor and applicability
respectively.
(4)
Providing a comprehensive set of evidence and counter-evidence,
but present each piece of evidence concisely.
5.2 The Design
Next, we translated the design strategies into a concrete interaction
design and an initial prototype. Given the focus of this research, we
focused on designing the part of the DST interface that calibrates
clinicians’ trust in AI suggestions (Figure 1, red arrow.)
(1)
Drawing the evidence from information sources that clini-
cians already understand and trust. We chose to use biomed-
ical literature as the source of AI “explanation” information
because clinicians across disease areas trust it as a source of
decision evidence.
(2)
Providing evidence for/against the AI’s suggestion, rather
than explaining the AI’s inner workings. Using the PICO frame-
work, the new DST design retrieves biomedical articles that
match the patient’s situation (
p
opulation) and the clinical ac-
tion that the AI suggests taking (
i
ntervention, intervention
comparator, and outcome of interest).
(3)
Providing evidence of the AI suggestion’s rigor and appli-
cability respectively. We included various types of literature,
from randomized clinical trial reports (rigorous but imprecise
Harnessing Biomedical Literature to Calibrate Clinicians’ Trust in AI Decision Support Systems CHI ’23, April 23–28, 2023, Hamburg, Germany
patient population match) to historical case studies (potentially
precise patient situation matches, but less rigorous.)
(4)
Provide a comprehensive set of evidence but present each
piece of evidence concisely. While presenting a comprehen-
sive list of relevant articles, we summarized each in just a single
sentence. Although drawn from study 1 observations, this de-
sign choice echoes existing counterfactual-based [
8
,
38
] and
modular AI explanation designs [
3
]. Noteworthily, our design
diers in that its evidence comes from experimentally-proven
facts rather than probabilistic AI predictions.
The design of the remainder of the DST interface simply followed
the lessons from prior research [
73
,
75
]. The resulting DST interface
has three parts: Occupying most of the interface is the patient’s
demographic information, medical history, lab test results, and other
EHR information. On the right side is AI’s personalized diagnostics
or treatment suggestion. Because we wanted to probe clinicians’
feedback towards our design strategy “providing evidence rather
than AI explanation”, the prototype did not include any explanation
of AI performance indicators or inner workings. This allowed us to
observe whether they requested to use this information.
We populated the prototype with literature information using a
combination of multiple functioning machine learning (ML) mod-
els and Wizard of Oz (WoZ, only for prediction quality control).
First, we searched for relevant articles using a purpose-built search
engine that is built upon EBM-NLP, a PICO-based biomedical liter-
ature dataset [
50
] and clinical bioBERT [
40
]. We summarized each
literature article by leveraging the extractive summarization func-
tion of GPT-3 (davinci-001), a pre-trained large language model [
6
].
Finally, we as “wizards" manually corrected the few errors in these
ML-curated results (because the focus of this study is not to evalu-
ate these ML models’ document retrieval or summarization abilities.
Instead, our goal is to probe whether our design goals and design
choices are promising directions as such language models become
increasingly capable.
6 STUDY TWO FINDINGS - HOW CLINICIANS
VALIDATED AI SUGGESTIONS WITH
LITERATURE
In study two, we interviewed an additional 9clinicians for their
feedback on the new DST design and collaboratively improved the
prototype. Our ndings conrmed that the design strategies drawn
from clinicians’ naturalistic suggestion-validation interactions can
help calibrate their trust in AI suggestions. Moreover, clinicians’ in-
teractions with the prototype also revealed new design and research
opportunities around (1) harnessing the complementary strengths
of literature-based and predictive decision supports; (2) mitigating
risks of de-skilling clinicians; and (3) oering low-data decision
support with literature.
6.1 Conrming the Design Strategies
Literature evidence is a valuable complement to “big data”. All
clinicians we interviewed welcomed the shift of focus from explain-
ing AI’s inner workings to providing clinical evidence of the AI’s
suggestions. While the prototype included no explanations of the
AI or its performance indicators, only one clinician requested some
explanation (She requested to see which hospitals’ patient data the
AI suggestions are trained on.) In contrast, clinicians described the
causality-focused literature evidence as a valuable complement to
the correlation-driven “big data predictions”. They analogized AI
predictions to “aggregating many, many experiences of many doctors”
while describing literature evidence as embodying the mantra of
Evidence-Based Medicine. The former resembles clinicians’ tacit
practical experience, while the latter oers proven knowledge and
what“most doctors decided was a good decision.” Both cover critical
aspects of clinical decision-making.
“There are dierent kinds of information that can in-
form your decision. [...] There’s evidence-based medicine
(EBM). [...] But experience is also very important. Mind
you medicine (practice) is not really like mathematics.
It is not an accurate science. It always has surprises and
the common knowledge that people have from other
doctors.”. (P21, family medicine physician)
Clinicians working in interdisciplinary medical domains (e.g.,
emergency care, family medicine, internal medicine) particularly
appreciated the literature evidence. These domains cover “a very,
very large area of knowledge”. Literature evidence can help them
examine a broader range of decision hypotheses and AI suggestions.
Clinicians highly valued the evidence of AI suggestion appli-
cability. Echoing ndings from study one, the prototyping study
punctuated the critical importance of AI suggestions’ applicabil-
ity. Recall our initial prototype oered a brief summary of each
biomedical article’s ndings. The clinicians we interviewed instead
demanded an explicit description of the patient situation the article
focuses on, even if it means more texts to read (See Figure 2 for
the revised design). In order to more easily examine the literature
evidence’s applicability, some requested us to color code the dif-
ferent characteristics of the article’s population of focus and to
highlight the corresponding characteristics in the current patient’s
EHR information (P14, P15, P17). “How the medicine ts into the
patient’s narratives is important.” (P17).
“I need to gure out how to treat this really complex
disease . . . it’s the treatment for one can kill the patient
with the other”. (P15, Emergency Room resident)
Clinicians demanded a concise summary of a comprehen-
sive set of evidence. All clinicians we interviewed expected the
literature evidence to be comprehensive. They considered compre-
hensiveness to be a key strength of literature. For example, while
seeing “big data” as capturing patient outcomes of the lab tests
and treatments that have long been available, clinicians considered
literature as capturing the latest etiology, tests, and treatment op-
tions that can update their knowledge. The comprehensiveness of
literature evidence is particularly valuable to clinicians when they
discuss patient cases with colleagues and mentees.
“(I want to use this tool) when I’m presenting this [pa-
tient] case on the rounds to somebody, just put my
thoughts together and make myself sound more on top
of things. There is an element of showmanship in rounds,
where you are trying to show how smart you are, that
CHI ’23, April 23–28, 2023, Hamburg, Germany Yang et al.
Figure 2: The revised DST design resulted from a co-prototyping study with 9clinicians. The clinicians
found the literature evidence useful in complimenting the data-driven DST predictions.
you’ve thought everything, that you got everything un-
der control which you don’t.” (P13, Emergency Room
physician)
Clinicians wanted the most concise summary possible while cov-
ering the literature evidence comprehensively. “Ideally, it can just
give me some kind of alert”, an alert that the literature says what the
AI suggests and what they are about to do are wrong. The Emer-
gency Room clinicians we interviewed all requested the literature
evidence to be summarized into such an alert, because otherwise
“You know, if it’s 2 o’clock in the morning and I’m seeing a patient with
phenomena, I’m not going to look at this.” Other clinicians expressed
more appreciation of the detailed literature evidence, as they may
study it before meeting patients or after work (e.g., “when you sit
at home at night, thinking through this patient.”) In the meantime,
they also wished for “literature evidence in a condensed form” such
that they could use it when “sitting in front of patients and needing
to decide something quickly”.
Clinicians trust biomedical literature for debasing AI sugges-
tions. Our DST prototype intends to calibrate clinician-AI trust
in diagnostic and treatment decision-making with literature evi-
dence. Interestingly, clinicians appreciated the literature not only
in these decision contexts, but also in debasing the AIs’ and their
own decision-making. This sentiment conrmed that to clinicians,
literature is often a more trusted source of information than explana-
tions of AI’s inner workings. It also suggests that DSTs can leverage
literature for interactions beyond validating AI’s diagnostic and
treatment suggestions.
“Literature can point out biases doctors made (in the AI
training data), like black patients didn’t get (cardiac)
catheterization as frequently as white patients did, and
women didn’t get evaluated for coronary diseases as
often as men did. Women present dierently with coro-
nary symptoms; they don’t describe [the symptoms]
using the same words that were used in the textbooks,
because the textbooks were based upon male patients.
[The literature] It’s very thorough.” (P13, Emergency
Room physician)
6.2 Revealing New Design Implications
The preceding observations conrmed that the types of informa-
tion clinicians used to examine each other’s suggestions—external
evidence of the suggestions’ rigor and applicability—could also help
them examine AI suggestions. Noteworthily, biomedical literature
and our prototype design represented merely only one source of
Harnessing Biomedical Literature to Calibrate Clinicians’ Trust in AI Decision Support Systems CHI ’23, April 23–28, 2023, Hamburg, Germany
such external evidence and one way of presenting it. Clinicians’
suggestions for improving the prototype revealed additional oppor-
tunities in this newly opened design space.
Harnessing additional sources of clinical evidence: medical im-
ages and -omics data. The clinicians we interviewed requested
forms of clinical evidence beyond text-based literature summaries.
For example, multiple physicians requested image-based literature
evidence to help validate computer vision suggestions and make
image-based care decisions. The “gold standard” reference images
in pathology and dermatology literature can be valuable in these
decision contexts. Similarly, pharmacists expressed a desire for lit-
erature evidence that matches patients’ “-omics data” (patient’s
genetic or molecular proles, which indicate their likely disease
symptoms and medication responses).
Harnessing literature evidence as stand-alone decision support
for low-resource hospitals. Multiple clinicians suggested that the
literature evidence alone can be valuable for low-resource hospitals,
such as rural health clinics and veterans’ hospitals. While some of
these hospitals have only recently moved to EHR and may not have
deployed AI-based DSTs, the literature evidence can nonetheless
benet their workforce, who are likely to be less experienced and
in a more signicant shortage of physicians.
6.3
De-skilling Clinicians? A Risk of Calibrating
Clinician-AI Trust with Literature Evidence
Among the nine clinicians we interviewed, two (P13, P15) suggested
a concern over clinicians becoming over-reliant on clinical litera-
ture. While other clinicians characterized the literature evidence
as a trigger for self-reection and slow-thinking, these clinicians
analogized literature evidence to medical textbooks or even step-
by-step instructions.
P15 describing her ideal form of literature evidence:
“It’s just kind of like someone made bullet points for
other providers saying. Hey, this is what we do.”
In this context, they pushed back on seeing literature evidence
on the DST, as it might over time undermine clinicians’ ‘ability to
‘think on their feet” at the point of care. “A good physician should be
able to practice medicine in a power failure.” P13 stated.
7 DISCUSSION
Clinical decision support tools (DSTs), powered by Articial Intelli-
gence (AI), promise to improve clinicians’ diagnosis and treatment
decision-making process. However, no AI model is always correct.
DSTs must enable clinicians to validate AI suggestions on a case-by-
case basis, convincing them to take AI’s correct suggestions while
rejecting its errors. Prior DST designs often explained AI’s inner
workings or performance indicators. This paper provided an alter-
native perspective to this approach. Drawing from how clinicians
validated each other’s suggestions in practice, we demonstrated
that DSTs might become more eective in calibrating clinicians’
trust in AI suggestions if they provided evidence of the suggestions’
robustness and applicability to the specic patient situation at hand.
Such evidence should be comprehensive but concise and come from
a shared source of truth with clinicians.
This approach oers a timely answer to HCI communities’ call
for a “use-context-based” approach to designing explainable AI [
43
].
Below, we rst discuss how our ndings point to new design oppor-
tunities around clinician-AI trust calibration and XAI. We then take
a step back and more critically reect on (2) the role literature-based
AI and patient-history-based AI can and should play respectively
in supporting clinician decision making.
7.1
Designing for Clinician-AI Trust Calibration
Like many others in the HCI community [
25
,
35
,
75
], we argue that
“explaining how AI work” cannot fall calibrate clinicians’ trust in
AI’s individual suggestions, even though it is indispensable in many
other contexts (e.g., AI accountability, regulation). To help draw this
distinction, we promote the terminology shift away from “XAI” as a
catch-all term, to the concept of “trust-calibration interaction design”.
In doing so, we hope to open up a new design space that explores a
wide range of trust calibration information and interactions that go
beyond explaining AI’s inner workings or performance indicators.
The literature-based prototype presented in this paper provides one
example. Future DST design practitioners should further explore
this rich design space.
To jump start this exploration, our empirical ndings identi-
ed three characteristics of biomedical literature that have made
it eective in calibrating clinicians’ trust in AI. First, it contains
information that is inherently trustworthy to clinicians, an essential
requirement of clinical decision-support information [
34
]. Second,
literature evidence is a shared source of truth among clinicians of all
medical domains, therefore useful for real-world clinical decision
making which is often highly collaborative and interdisciplinary.
Third, unlike explanations of AI’s inner-workings, the literature
contains various information that collectively can support all types
of clinical decisions (e.g., etiology, diagnosis, treatment, medical
imaging interpretation). When caring for each patient, clinicians
make a series of interconnected decisions rather than discrete ones
(e.g., identifying the cause of a symptom overlaps with concluding
a diagnosis) [
76
]. Drawing from the same source of evidence across
these decisions reduces clinicians’ cognitive load and saves time.
Future research should explore new information sources that
share these characteristics (e.g., biomedical literature, medical im-
age references, genomics data) to address AI’s potential errors and
biases. Recent AI research that cross-checks EHR-based AI patient
outcome predictions against the biological relations between treat-
ment and outcome [
2
] and against clinician-authored Knowledge
Graph [4] can be seen as great examples in this direction.
At a higher level, we advocate for an “infrastructural approach”
to calibrate clinicians’ trust in clinical AI systems. Even though
each EHR-based AI most often supports one clinician in making
one decision a time, AI trust-calibration design should consider the
broader context that clinicians collaborate constantly and make in-
terconnected decisions. In this context, it is more eective for DSTs
to draw from a consistent source of information that can validate di-
verse AI suggestions (regardless of whether it is a computer vision
or a simple Bayesian model). DSTs can then adapt the information
presentation and interaction design to each decision, its particular
human context, and the AI system involved. Biomedical literature
and genomics data have started becoming part of the healthcare
CHI ’23, April 23–28, 2023, Hamburg, Germany Yang et al.
information infrastructure; Can DSTs harness this infrastructure
and support clinicians’ trust calibration in AI? What other informa-
tion sources can become such a rising tide that raises all (AI) boats?
These are exciting questions for future work to explore.
7.2 Blending Diverse Sources of Intelligence To
Support Clinical Decision-Making
We have discussed the design and research opportunities that har-
nessing literature as a source of trust-calibration information has
revealed. Parallel to these eorts should be work that critically
reects on the limitations of literature (or information that shares
some of its characteristics) we have observed. We discuss these
limitations against the two axes of dierence that have scaolded
the debates around “what constitutes the best evidence for clinical
decision-making”. These axes oer a useful structure for future
research to investigate: How can DSTs best harness the com-
plementary strengths of literature-based and EHR-based decision
supports while ameliorating their weaknesses?
Evidence rigor and evidence applicability. The clinical world
famously values causality: every clinical action ideally has a causal
relationship with improved patient outcome, and the causality ide-
ally has been proven by randomized clinical trials on a large, diverse
patient population [
7
,
30
]. Clinicians’ focus on causality and large-
scale evaluation is often described as causing their resistance to
AI’s highly-personalized decision suggestions [25, 34, 74, 75].
Our study 1 ndings suggest that there is more to the story:
Clinicians value the external validity of decision evidence (Does
the information apply to this particular patient?) as much as its
internal validity (How rigorously was it generated? Was it validated
in a large patient population?) The perfect evidence is one that has
been proven in a large patient population that resembles their
patient’s situation 100%. Such evidence simply does not exist. It
is in this context that we do not see the success of our design as
an indication that literature evidence should replace explanations
of AI’s inner workings. Instead, clinicians appreciated our design
because it paired AI’s personalized suggestion (high applicability)
with causality-based literature evidence (high internal rigor). We
see an opportunity for future research to reect on and further
improve the way our prototype coordinated AI explanations and
literature evidence, so that they can together form the “perfect”
evidence for clinicians’ decisions.
Explicit knowledge and tacit knowledge. Evidence, whether it
is from past patients’ data or clinical literature, is not omnipotent.
The situated and interpretative nature of many clinical decisions
requires knowledge beyond what’s proven eective in past patient
cases or clinical trials [
61
]. But in what circumstances should one
trust clinicians’ “experience”, or tacit knowledge, and when should
one instead force them to think slow and consider both the support-
ing and counter-evidence of their judgments? This is a question
of long-standing debates in medicine and in AI decision support
design [75].
This question also underlies the limitations of literature-based
evidence. On the one hand, how can literature- and EHR-based AI
systems enhance, rather than distract clinicians from their trained
intuitions or even de-skill them? The emergency room physicians
we interviewed wished our literature evidence system took the form
of an alert button “Literature evidence in aggregate disagree with your
judgment”, pointing to an interesting direction for exploration. On
the other hand, how can literature-based AI better serve decisions
where clinicians’ intuition falls short? For example, one clinician
wished for a literature-based AI that reminds them of the gender
and racial biases in old clinical trial results as well as in their own
decision-making. By sharing these clinician critiques, we hope
to start a reective discussion about how HCI research can better
harness the specic types of information in the literature to enhance
clinician judgments in dierent ways.
8 CONCLUSION AND LIMITATIONS
This paper illustrates how clinicians used literature to validate
each other’s suggestions and presented a new DST that embraces
such naturalistic interactions. The new desig uses GPT-3 to draw
literature evidence that shows the AI suggestions’ robustness and
applicability (or the lack thereof). In doing so, we promote a new
approach to “explainable AI” that focuses on not explaining the AI
per se, but on designing information for intuitive AI trust calibration.
In the midst of explosive growth in Foundational Language Models,
this work revealed new design and research opportunities around
(1) harnessing the complementary strengths of literature-based
and predictive decision supports; (2) mitigating risks of de-skilling
clinicians; and (3) oering low-data decision support with literature.
Parallel to work developing this new design approach, there
should also be work critically examining it. Here, we also highlight
two limitations of this work that merit further research. First, while
the GPT-3-based prototype in this work was sucient for most
clinicians we interviewed, more work needs to understand the
errors such models can make. Today’s language technologies are
by no means perfect in processing biomedical texts. Such errors
can be particularly critical for Emergency Care, where clinicians
concise summaries of literature evidence without losing necessary
caveats and nuances. Second, this work studied clinicians from
14 dierent medical specialties. Future work should evaluate this
literature-based design in more disease areas to further understand
the scope of this generalizability.
ACKNOWLEDGMENTS
The rst author’s eort is partially supported by the AI2050 Early
Career Fellowship. This work was supported by Cornell and Weill
Cornell Medicine’s Multi-Investigator Seed Grants (MISGs) “Lever-
aging Biomedical Literature in Supporting Clinical Reasoning and
Decision Making”.
REFERENCES
[1] [n.d.]. Journal Club for iPhone/Android. https://wikijournalclub.org/app/
[2]
Alexis Allot, Yifan Peng, Chih-Hsuan Wei, Kyubum Lee, Lon Phan, and Zhiyong
Lu. 2018. LitVar: a semantic search engine for linking genomic variant data in
PubMed and PMC. Nucleic acids research 46, W1 (2018), W530–W536.
[3]
David Alvarez-Melis, Harmanpreet Kaur, Hal Daumé III, Hanna Wallach, and
Jennifer Wortman Vaughan. 2021. From human explanation to model inter-
pretability: A framework based on weight of evidence. In AAAI Conference on
Human Computation and Crowdsourcing (HCOMP).
[4]
Daniel M Bean, Honghan Wu, Ehtesham Iqbal, Olubanke Dzahini, Zina M Ibrahim,
Matthew Broadbent, Robert Stewart, and Richard JB Dobson. 2017. Knowledge
graph prediction of unknown adverse drug reactions and validation in electronic
health records. Scientic reports 7, 1 (2017), 1–11.
Harnessing Biomedical Literature to Calibrate Clinicians’ Trust in AI Decision Support Systems CHI ’23, April 23–28, 2023, Hamburg, Germany
[5]
Mary Jo Bitner, Amy L Ostrom, and Felicia N Morgan. 2007. Service Blueprinting:
A Practical Technique for Service Innovation. (2007).
[6]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al
.
2020. Language models are few-shot learners. Advances in neural
information processing systems 33 (2020), 1877–1901.
[7]
Patricia B Burns, Rod J Rohrich, and Kevin C Chung. 2011. The Levels of Evidence
and their role in Evidence-Based Medicine. Plastic and reconstructive surgery 128,
1 (2011), 305.
[8]
Ruth MJ Byrne. 2019. Counterfactuals in Explainable Articial Intelligence (XAI):
Evidence from Human Reasoning.. In IJCAI. 6276–6282.
[9]
Carrie J Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry.
2019. "Hello AI": Uncovering the Onboarding Needs of Medical Practitioners for
Human-AI Collaborative Decision-Making. Proceedings of the ACM on Human-
computer Interaction 3, CSCW (2019), 1–24.
[10]
Feixiong Cheng and Zhongming Zhao. 2014. Machine learning-based prediction
of drug–drug interactions by integrating drug phenotypic, therapeutic, chemical,
and genomic properties. Journal of the American Medical Informatics Association
21, e2 (2014), e278–e286.
[11]
Deborah J Cook, Cynthia D Mulrow, and R Brian Haynes. 1997. Systematic
reviews: synthesis of best evidence for clinical decisions. Annals of internal
medicine 126, 5 (1997), 376–380.
[12]
Juliet Corbin and Anselm Strauss. 2014. Basics of qualitative research: Techniques
and procedures for developing grounded theory. Sage publications.
[13]
Azra Daei, Mohammad Reza Soleymani, Hasan Ashra-rizi, Ali Zargham-
Boroujeni, and Roya Kelishadi. 2020. Clinical information seeking behavior
of physicians: A systematic review. International Journal of Medical Informatics
139 (2020), 104144. https://doi.org/10.1016/j.ijmedinf.2020.104144
[14]
Guilherme Del Fiol, T Elizabeth Workman, and Paul N Gorman. 2014. Clinical
questions raised by clinicians at the point of care: a systematic review. JAMA
internal medicine 174, 5 (2014), 710–718.
[15]
Mariah Dreisinger, Terry L Leet, Elizabeth A Baker, Kathleen N Gillespie, Beth
Haas, and Ross C Brownson. 2008. Improving the public health workforce:
evaluation of a training course to enhance evidence-based decision making.
Journal of Public Health Management and Practice 14, 2 (2008), 138–143.
[16] Lilian Edwards and Michael Veale. 2017. Slave to the algorithm: Why a right to
an explanation is probably not the remedy you are looking for. Duke L. & Tech.
Rev. 16 (2017), 18.
[17]
Upol Ehsan, Pradyumna Tambwekar, Larry Chan, Brent Harrison, and Mark O.
Riedl. 2019. Automated Rationale Generation: A Technique for Explainable AI
and Its Eects on Human Perceptions. In Proceedings of the 24th International
Conference on Intelligent User Interfaces (Marina del Ray, California) (IUI ’19).
Association for Computing Machinery, New York, NY, USA, 263–274. https:
//doi.org/10.1145/3301275.3302316
[18]
Shaker El-Sappagh, Farman Ali, Abdeltawab Hendawi, Jun-Hyeog Jang, and
Kyung-Sup Kwak. 2019. A mobile health monitoring-and-treatment system based
on integration of the SSN sensor ontology and the HL7 FHIR standard. BMC
medical informatics and decision making 19, 1 (2019), 97.
[19]
John W Ely, Jerome A Oshero, M Lee Chambliss, Mark H Ebell, and Marcy E
Rosenbaum. 2005. Answering physicians’ clinical questions: obstacles and po-
tential solutions. Journal of the American Medical Informatics Association 12, 2
(2005), 217–224.
[20]
Matthew E Falagas, Eleni I Pitsouni, George A Malietzis, and Georgios Pappas.
2008. Comparison of PubMed, Scopus, web of science, and Google scholar:
strengths and weaknesses. The FASEB journal 22, 2 (2008), 338–342.
[21]
A. R. Firestone, D. Sema, T. J. Heaven, and R. A. Weems. 1998. The eect of
a knowledge-based, image analysis and clinical decision support system on
observer performance in the diagnosis of approximal caries from radiographic
images. Caries research 32, 2 (Mar 1998), 127–34. https://search.proquest.com/
docview/220212660?accountid=9902
[22]
Gary N Fox and Nashat S Moawad. 2003. UpToDate: a comprehensive clinical
database. Journal of family practice 52, 9 (2003), 706–710.
[23]
Nasra Gathoni. 2021. Evidence Based Medicine: The Role of the Health Sciences
Librarian. Library Philosophy and Practice (e-journal) 6627 (2021), 1.
[24]
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan,
Hanna Wallach, Hal Daumé Iii, and Kate Crawford. 2021. Datasheets for datasets.
Commun. ACM 64, 12 (2021), 86–92.
[25]
Marzyeh Ghassemi, Luke Oakden-Rayner, and Andrew L Beam. 2021. The false
hope of current approaches to explainable articial intelligence in health care.
The Lancet Digital Health 3, 11 (2021), e745–e750.
[26]
Randy Goebel, Ajay Chander, Katharina Holzinger, Freddy Lecue, Zeynep Akata,
Simone Stumpf, Peter Kieseberg, and Andreas Holzinger. 2018. Explainable AI:
The New 42?. In Machine Learning and Knowledge Extraction, Andreas Holzinger,
Peter Kieseberg, A Min Tjoa, and Edgar Weippl (Eds.). Springer International
Publishing, Cham, 295–303.
[27]
Paul N Gorman and Mark Helfand. 1995. Information seeking in primary care:
how physicians choose which clinical questions to pursue and which to leave
unanswered. Medical Decision Making 15, 2 (1995), 113–119.
[28]
Trisha Greenhalgh. 2002. Intuition and evidence–uneasy bedfellows? British
Journal of General Practice 52, 478 (2002), 395–400.
[29]
Gordon Guyatt, Drummond Rennie, Maureen Meade, Deborah Cook, et al
.
2002.
Users’ guides to the medical literature: a manual for evidence-based clinical practice.
Vol. 706. AMA press Chicago.
[30]
Judith Haber. 2018. PART II Processes of Developing EBP and Questions in
Various Clinical Settings. Evidence-Based Practice for Nursing and Healthcare
Quality Improvement-E-Book (2018), 31.
[31]
C Harris and T Turner. 2011. Evidence-Based Answers to Clinical Questions for
Busy Clinicians. In Centre for Clinical Eectiveness. Monash Health, 1–32.
[32]
Andreas Holzinger, Bernd Malle, Peter Kieseberg, Peter M. Roth, Heimo Müller,
Robert Reihs, and Kurt Zatloukal. 2017. Towards the Augmented Pathologist:
Challenges of Explainable-AI in Digital Pathology. CoRR abs/1712.06657 (2017).
arXiv:1712.06657 http://arxiv.org/abs/1712.06657
[33]
Julie A Jacobs, Elizabeth A Dodson, Elizabeth A Baker, Anjali D Deshpande, and
Ross C Brownson. 2010. Barriers to evidence-based decision making in public
health: a national survey of chronic disease practitioners. Public Health Reports
125, 5 (2010), 736–742.
[34]
Maia Jacobs, Jerey He, Melanie F. Pradier, Barbara Lam, Andrew C Ahn,
Thomas H McCoy, Roy H Perlis, Finale Doshi-Velez, and Krzysztof Z Gajos.
2021. Designing AI for trust and collaboration in time-constrained medical deci-
sions: a sociotechnical lens. In Proceedings of the 2021 CHI Conference on Human
Factors in Computing Systems. 1–14.
[35]
Harmanpreet Kaur, Eytan Adar, Eric Gilbert, and Cli Lampe. 2022. Sensible
AI: Re-Imagining Interpretability and Explainability Using Sensemaking Theory.
In 2022 ACM Conference on Fairness, Accountability, and Transparency (Seoul,
Republic of Korea) (FAccT ’22). Association for Computing Machinery, New York,
NY, USA, 702–714. https://doi.org/10.1145/3531146.3533135
[36]
Ajay Kohli and Saurabh Jha. 2018. Why CAD failed in mammography. Journal
of the American College of Radiology 15, 3 (2018), 535–537.
[37]
Michael Kronenfeld, Priscilla L Stephenson, Barbara Nail-Chiwetalu, Elizabeth M
Tweed, Eric L Sauers, Tamara C Valovich McLeod, Ruiling Guo, Henry Trahan,
Kristine M Alpi, Beth Hill, et al
.
2007. Review for librarians of evidence-based
practice in nursing and the allied health professions in the United States. Journal
of the Medical Library Association: JMLA 95, 4 (2007), 394.
[38]
Thao Le, Tim Miller, Ronal Singh, and Liz Sonenberg. 2022. Improving Model
Understanding and Trust with Counterfactual Explanations of Model Condence.
arXiv preprint arXiv:2206.02790 (2022).
[39]
Howard Lee and Yi-Ping Phoebe Chen. 2015. Image based computer aided
diagnosis system for cancer detection. Expert Systems with Applications 42, 12
(2015), 5356–5365.
[40]
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim,
Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language
representation model for biomedical text mining. Bioinformatics 36, 4 (2020),
1234–1240.
[41]
Constance D Lehman, Robert D Wellman, Diana SM Buist, Karla Kerlikowske,
Anna NA Tosteson, and Diana L Miglioretti. 2015. Diagnostic accuracy of digital
screening mammography with and without computer-aided detection. JAMA
internal medicine 175, 11 (2015), 1828–1837.
[42]
Q.Vera Liao and S. Shyam Sundar. 2022. Designing for Responsible Trust
in AI Systems: A Communication Perspective. In 2022 ACM Conference on
Fairness, Accountability, and Transparency (Seoul, Republic of Korea) (FAccT
’22). Association for Computing Machinery, New York, NY, USA, 1257–1268.
https://doi.org/10.1145/3531146.3533182
[43]
Q. Vera Liao, Yunfeng Zhang, Ronny Luss, Finale Doshi-Velez, and Amit Dhurand-
har. 2022. Connecting Algorithmic Research and Usage Contexts: A Perspective
of Contextualized Evaluation for Explainable AI. https://doi.org/10.48550/ARXIV.
2206.10847
[44]
Youn-Kyung Lim, Erik Stolterman, and Josh Tenenberg. 2008. The anatomy of
prototypes: Prototypes as lters, prototypes as manifestations of design ideas.
ACM Transactions on Computer-Human Interaction (TOCHI) 15, 2 (2008), 1–27.
[45] Tuuli Mattelmäki et al. 2006. Design probes. Aalto University.
[46]
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman,
Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019.
Model cards for model reporting. In Proceedings of the conference on fairness,
accountability, and transparency. 220–229.
[47]
Bill Moggridge and Bill Atkinson. 2007. Designing interactions. Vol. 17. MI T press
Cambridge, MA.
[48]
M Hassan Murad, Noor Asi, Mouaz Alsawas, and Fares Alahdab. 2016. New
evidence pyramid. BMJ Evidence-Based Medicine 21, 4 (2016), 125–127.
[49]
Eva Nourbakhsh, Rebecca Nugent, Helen Wang, Cihan Cevik, and Kenneth
Nugent. 2012. Medical literature searches: a comparison of P ub M ed and G
oogle S cholar. Health Information & Libraries Journal 29, 3 (2012), 214–222.
[50]
Benjamin E. Nye, Ani Nenkova, Iain J. Marshall, and Byron C. Wallace. [n.d.].
Trialstreamer: Mapping and Browsing Medical Evidence in Real-Time. ([n. d.]).
http://arxiv.org/abs/2005.10865v1
[51]
Cecilia Panigutti, Andrea Beretta, Fosca Giannotti, and Dino Pedreschi. 2022.
Understanding the Impact of Explanations on Advice-Taking: A User Study
CHI ’23, April 23–28, 2023, Hamburg, Germany Yang et al.
for AI-Based Clinical Decision Support Systems. In Proceedings of the 2022 CHI
Conference on Human Factors in Computing Systems (New Orleans, LA, USA)
(CHI ’22). Association for Computing Machinery, New York, NY, USA, Article
568, 9 pages. https://doi.org/10.1145/3491102.3502104
[52]
Kate Radclie, Helena C Lyson, Jill Barr-Walker, and Urmimala Sarkar. 2019.
Collective intelligence in medical decision-making: a systematic scoping review.
BMC medical informatics and decision making 19, 1 (2019), 1–11.
[53]
Pranav Rajpurkar, Jeremy A. Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta,
Tony Duan, Daisy Yi Ding, Aarti Bagul, C. Langlotz, Katie S. Shpanskaya,
Matthew P. Lungren, and A. Ng. 2017. CheXNet: Radiologist-Level Pneumonia
Detection on Chest X-Rays with Deep Learning. ArXiv abs/1711.05225 (2017).
[54]
Amy Rechkemmer and Ming Yin. 2022. When Condence Meets Accuracy:
Exploring the Eects of Multiple Performance Indicators on Trust in Machine
Learning Models. In Proceedings of the 2022 CHI Conference on Human Factors in
Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing
Machinery, New York, NY, USA, Article 535, 14 pages. https://doi.org/10.1145/
3491102.3501967
[55]
Anthony L Rosner. 2012. Evidence-based medicine: revisiting the pyramid of
priorities. Journal of Bodywork and Movement Therapies 16, 1 (2012), 42–49.
[56]
Erika Rovini, Carlo Maremmani, and Filippo Cavallo. 2017. How wearable sensors
can support Parkinson’s disease diagnosis and treatment: a systematic review.
Frontiers in neuroscience 11 (2017), 555.
[57]
David L Sackett, William MC Rosenberg, JA Muir Gray, R Brian Haynes, and
W Scott Richardson. 1996. Evidence based medicine: what it is and what it isn’t.
, 71–72 pages.
[58]
Fernando Suarez Saiz, Corey Sanders, Rick Stevens, Robert Nielsen, Michael Britt,
Leemor Yuravlivker, Anita M Preininger, and Gretchen P Jackson. 2021. Articial
Intelligence Clinical Evidence Engine for Automatic Identication, Prioritization,
and Extraction of Relevant Clinical Oncology Research. JCO Clinical Cancer
Informatics 5 (2021), 102–111.
[59]
Mike Schaekermann, Carrie J. Cai, Abigail E. Huang, and Rory Sayres. 2020.
Expert Discussions Improve Comprehension of Dicult Cases in Medical Image
Assessment. Association for Computing Machinery, New York, NY, USA, 1–13.
https://doi.org/10.1145/3313831.3376290
[60]
Andrew D Selbst and Solon Barocas. 2018. The intuitive appeal of explainable
machines. Fordham L. Rev. 87 (2018), 1085.
[61]
Mark Sendak, Madeleine Clare Elish, Michael Gao, Joseph Futoma, William
Ratli, Marshall Nichols, Armando Bedoya, Suresh Balu, and Cara O’Brien. 2020.
" The human body is a black box" supporting clinical decision-making with deep
learning. In Proceedings of the 2020 conference on fairness, accountability, and
transparency. 99–109.
[62]
Salimah Z Shari, Shayna AD Bejaimal, Jessica M Sontrop, Arthur V Iansavichus,
R Brian Haynes, Matthew A Weir, and Amit X Garg. 2013. Retrieving clinical
evidence: a comparison of PubMed and Google Scholar for quick clinical searches.
Journal of medical Internet research 15, 8 (2013), e2624.
[63]
Michael Simmons, Ayush Singhal, and Zhiyong Lu. 2016. Text mining for preci-
sion medicine: bringing structure to EHRs and biomedical literature to understand
genes and health. Translational Biomedical Informatics (2016), 139–166.
[64]
Marion K Slack and Jolaine R Draugalis Jr. 2001. Establishing the internal and
external validity of experimental studies. American journal of health-system
pharmacy 58, 22 (2001), 2173–2181.
[65]
Richard Smith. 1996. What clinical information do doctors need? Bmj 313, 7064
(1996), 1062–1068.
[66]
Emily Sullivan. 2020. Understanding from machine learning models. The British
Journal for the Philosophy of Science (2020).
[67]
Reed T Sutton, David Pincock, Daniel C Baumgart, Daniel C Sadowski, Richard N
Fedorak, and Karen I Kroeker. 2020. An overview of clinical decision support
systems: benets, risks, and strategies for success. NPJ digital medicine 3, 1 (2020),
1–10.
[68]
Audrey Tan, Mark Durbin, Frank R Chung, Ada L Rubin, Allison M Cuthel,
Jordan A McQuilkin, Aram S Modrek, Catherine Jamin, Nicholas Gavin, Devin
Mann, et al
.
2020. Design and implementation of a clinical decision support tool
for primary palliative Care for Emergency Medicine (PRIM-ER). BMC medical
informatics and decision making 20, 1 (2020), 1–11.
[69]
Myriam Tanguay-Sela, David Benrimoh, Christina Popescu, Tamara Perez,
Colleen Rollins, Emily Snook, Eryn Lundrigan, Caitrin Armstrong, Kelly Perlman,
Robert Fratila, Joseph Mehltretter, Sonia Israel, Monique Champagne, Jérôme
Williams, Jade Simard, Sagar V. Parikh, Jordan F. Karp, Katherine Heller, Outi
Linnaranta, Liliana Gomez Cardona, Gustavo Turecki, and Howard C. Margolese.
2022. Evaluating the perceived utility of an articial intelligence-powered clini-
cal decision support system for depression treatment using a simulation center.
Psychiatry Research 308 (2022), 114336. https://doi.org/10.1016/j.psychres.2021.
114336
[70]
Anja Thieme, Maryann Hanratty, Maria Lyons, Jorge E Palacios, Rita Marques,
Cecily Morrison, and Gavin Doherty. 2022. Designing Human-Centered AI for
Mental Health: Developing Clinically Relevant Applications for Online CBT
Treatment. ACM Transactions on Computer-Human Interaction (2022).
[71]
Simon M. Thomas, James G. Lefevre, Glenn Baxter, and Nicholas A. Hamilton.
2021. Interpretable deep learning systems for multi-class segmentation and
classication of non-melanoma skin cancer. Medical Image Analysis 68 (2021),
101915. https://doi.org/10.1016/j.media.2020.101915
[72]
Sahil Verma, John Dickerson, and Keegan Hines. 2020. Counterfactual explana-
tions for machine learning: A review. arXiv preprint arXiv:2010.10596 (2020).
[73]
Danding Wang, Qian Yang, Ashraf Abdul, and Brian Y. Lim. 2019. Designing
Theory-Driven User-Centric Explainable AI. In Proceedings of the 2019 CHI Con-
ference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI
’19). ACM, New York, NY, USA, Article 601, 15 pages. https://doi.org/10.1145/
3290605.3300831
[74]
Yao Xie, Melody Chen, David Kao, Ge Gao, and Xiang’Anthony’ Chen. 2020.
CheXplain: Enabling Physicians to Explore and Understand Data-Driven, AI-
Enabled Medical Imaging Analysis. In Proceedings of the 2020 CHI Conference on
Human Factors in Computing Systems. 1–13.
[75]
Qian Yang, AaronSteinfeld, and John Zimmerman. 2019. Unremarkable AI: Fitting
Intelligent Decision Support into Critical, Clinical Decision-Making Processes. In
Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems
(Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New
York, NY, USA, 1–11. https://doi.org/10.1145/3290605.3300468
[76]
Qian Yang, John Zimmerman, Aaron Steinfeld, Lisa Carey, and James F. Antaki.
2016. Investigating the Heart Pump Implant Decision Process: Opportunities
for Decision Support Tools to Help. In Proceedings of the 2016 CHI Conference on
Human Factors in Computing Systems (San Jose, California, USA) (CHI ’16). ACM,
New York, NY, USA, 4477–4488. https://doi.org/10.1145/2858036.2858373