Conference PaperPDF Available

Harnessing Biomedical Literature to Calibrate Clinicians’ Trust in AI Decision Support Systems



Clinical decision support tools (DSTs), powered by Artificial Intelligence (AI), promise to improve clinicians' diagnostic and treatment decision-making. However, no AI model is always correct. DSTs must enable clinicians to validate each AI suggestion, convincing them to take the correct suggestions while rejecting its errors. While prior work often tried to do so by explaining AI's inner workings or performance, we chose a different approach: We investigated how clinicians validated each other's suggestions in practice (often by referencing scientific literature) and designed a new DST that embraces these naturalistic interactions. This design uses GPT-3 to draw literature evidence that shows the AI suggestions' robustness and applicability (or the lack thereof). A prototyping study with clinicians from three disease areas proved this approach promising. Clinicians' interactions with the prototype also revealed new design and research opportunities around (1) harnessing the complementary strengths of literature-based and predictive decision supports; (2) mitigating risks of de-skilling clinicians; and (3) offering low-data decision support with literature.
Harnessing Biomedical Literature to
Calibrate Clinicians’ Trust in AI Decision Support Systems
Qian Yang
Cornell University
Ithaca, NY, USA
Yuexing Hao
Cornell University
Ithaca, NY, USA
Kexin Quan
University of California,
San Diego
San Diego, CA, USA
Stephen Yang
Cornell University
New York City, NY, USA
Yiran Zhao
Cornell Tech
New York City, NY, USA
Volodymyr Kuleshov
Cornell Tech
New York City, NY, USA
Fei Wang
Weill Cornell Medicine
New York City, NY, USA
Clinical decision support tools (DSTs), powered by Articial Intelli-
gence (AI), promise to improve clinicians’ diagnostic and treatment
decision-making. However, no AI model is always correct. DSTs
must enable clinicians to validate each AI suggestion, convincing
them to take the correct suggestions while rejecting its errors. While
prior work often tried to do so by explaining AI’s inner workings
or performance, we chose a dierent approach: We investigated
how clinicians validated each other’s suggestions in practice (often
by referencing scientic literature) and designed a new DST that
embraces these naturalistic interactions. This design uses GPT-3 to
draw literature evidence that shows the AI suggestions’ robustness
and applicability (or the lack thereof). A prototyping study with
clinicians from three disease areas proved this approach promis-
ing. Clinicians’ interactions with the prototype also revealed new
design and research opportunities around (1) harnessing the com-
plementary strengths of literature-based and predictive decision
supports; (2) mitigating risks of de-skilling clinicians; and (3) oer-
ing low-data decision support with literature.
Information systems
Information extraction; Applied
Health care information systems;Human-
centered computing Empirical studies in HCI.
Clinical AI, XAI, Biomedical Literature, Qualitative Method
ACM Reference Format:
Qian Yang, Yuexing Hao, Kexin Quan, Stephen Yang, Yiran Zhao, Volodymyr
Kuleshov, and Fei Wang. 2023. Harnessing Biomedical Literature to Calibrate
Equal contribution.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from
CHI ’23, April 23–28, 2023, Hamburg, Germany
©2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-9421-5/23/04. . . $15.00
Clinicians’ Trust in AI Decision Support Systems. In Proceedings of the
2023 CHI Conference on Human Factors in Computing Systems (CHI ’23),
April 23–28, 2023, Hamburg, Germany. ACM, New York, NY, USA, 14 pages.
The idea of leveraging machine intelligence to improve clinical
decision-making has fascinated healthcare and Articial Intelli-
gence (AI) researchers for decades. Today, diverse AI systems have
proven their performance in research labs and are moving into
clinical practice in the form of Decision Support Systems (DSTs).
From Bayesian models that predict treatment outcomes based on
Electronic Health Records (EHR) [
] to computer vision systems
that interpret medical images [
], from rule-based systems that
alert drug interactions [
] to wearable-sensing AIs that monitor
disease progression [
], AI-powered DSTs promise to reduce
clinician decision errors and improve patient outcomes.
While all clinical AI models can oer valuable diagnostic or
treatment suggestions, none is always correct. Therefore, clinicians
must calibrate their trust in each AI suggestion on a case-by-case
basis. DST interaction designs can help. Ideally, clinical DSTs can
provide information that enables clinicians to adopt only the correct
AI suggestions while staying unbiased by AI errors.
This is a challenging goal. Amongst other approaches, existing
DST designs most often supported clinician-AI trust calibration by
explaining how the AI generated its suggestions and how well it
performed on past patient data [
]. This explanation
approach struggled in clinical practice: When the clinician’s hy-
pothesis was wrong and the AI advice was correct, the explanations
rarely persuaded clinicians to take the advice [
When the AI suggestion was wrong, explanations could also fail to
help clinicians notice the error [
]. These failures propelled
some researchers to design more persuasive AI explanations, for
example, by presenting only parts of the explanations that justify
the AI’s output [
]. However, these designs are emergent, and they
have not yet empirically demonstrated their eectiveness.
This paper aims to identify new DST designs that can eectively
calibrate clinicians’ trust in AI suggestions on a case-by-case basis,
enabling them to take only correct suggestions while rejecting its
errors. Instead of exploring new ways to explain the AI and testing
them on clinicians afterward, we chose to start by investigating how
CHI ’23, April 23–28, 2023, Hamburg, Germany Yang et al.
clinicians validated each other’s diagnostic or treatment suggestions
in practice. We hoped that these naturalistic interactions would
reveal new insights into the kinds of information clinicians need
to calibrate their trust in AI suggestions. Given that biomedical
literature (e.g., reports of randomized controlled trials) is known
to play a central role in this process [
] and that clinicians had
rejected AI suggestions because they have not been published in
prestigious clinical journals [
], we paid particular attention
to how clinicians sought and used evidence from the literature.
Through contextual inquiries with 12 clinicians and their assis-
tants, we found that clinicians rarely explained how they came
up with the suggestion when exchanging diagnostic or treatment
suggestions. Instead, they sought all evidence that may validate or
invalidate the suggestion from biomedical literature, their shared
source of truth. They then examined these pieces of evidence based
on the evidence’s applicability to the specic patient situation at
hand, thereby concluding whether to accept the suggestion.
Embracing these ndings, we designed a novel form of DST that
imitates clinicians’ natural trust-calibration interactions. Rather
than explaining how the AI generated its suggestion, the DST pro-
vides literature evidence that can potentially validate or invalidate
the suggestions. In presenting the evidence, the DST highlights its
applicability to the particular patient situation in question, rather
than how rigorously it was concluded from past patient cases. A co-
prototyping study with 9clinicians conrmed these design strate-
gies. Further, clinicians’ interactions with the prototype surfaced ad-
ditional design and research opportunities around designing more
intuitive clinician-AI trust-calibration interactions and harnessing
biomedical literature for creating such interactions.
This paper presents the initial contextual inquiries, the new
DST design, and ndings from the prototyping study. It makes
three contributions. First, this work provides a rare description of
how clinicians sought information to validate each other’s diag-
nostic/treatment suggestions in their practice. It provides a timely
answer to researchers’ call for a use-context-focused approach to
designing explainable AI [
]. Second, this work identies an alter-
native approach to calibrating clinicians’ trust in AI. Our prototype
exemplies one possibility in this new design space. Third, the
prototyping study oers an initial description of how clinicians
deliberate clinical decisions using a predictive DST and biomedical
literature simultaneously. Both clinical DST and Biomedical Natural
Language Processing (biomedNLP) technologies are rapidly matur-
ing. This research oers a valuable reference for future research
that harness both technologies to improve clinical decision-making.
2.1 Challenges of Validating AI Suggestions
with Explainable AI
Across the elds of HCI and AI, extensive research has studied
how to calibrate clinicians’ trust in AI. Much of this research can
trace its origin to explainable AI literature that aims to make
AI less like a black box [
]. As a result, when designing
AI-powered DSTs, designers and researchers most often oered
clinicians explanations of the AI’s inner workings. The explana-
tions include, for example, descriptions of training data, its machine
learning model’s training and prediction processes, and the model’s
performance indicators [
]. Practitioner-facing tools
(e.g., Microsoft’s HAX responsible AI toolkit, data sheets [
model cards [
]) further boosted this approach’s real-world im-
pact. Experiments showed that AI explanations improved clinicians’
satisfaction with DSTs and increased the likelihood of them taking
the AI’s advice [9, 42, 51, 60].
However, because AI suggestions are not always correct, increas-
ing the likelihood of clinicians taking the AI’s advice did not mean
that clinicians made better decisions. To make better decisions, clin-
icians must calibrate their trust in each AI suggestion individually,
on a case-by-case basis. Existing AI explanations largely failed in
this regard. In clinical practice, the AI explanations were either
unable to persuade clinicians to take the AI’s correct advice (under-
trust) [
], or failed to enable clinicians to
reject AI’s errors (over-trust) [21, 32].
Recent research has identied several causes of AI explanations’
repeated failure.
AI explanations are not always available. Some clinical AI
systems (such as deep learning systems for medical imaging [
are uninterpretable even to AI researchers, much more so to
Too much information, too little time. In clinical practice,
clinicians make dozens, if not hundreds, of decisions in a day [
], while each AI informs a decision. It can seem unrealistic to
expect clinicians to comprehend all these AIs’ inner workings in
addition to caring for patients [26, 34, 61, 70, 75].
AI explanations are not intuitive or actionable to clini-
cians. In fast-paced clinical practice, clinicians need actionable
information [
]. Actionable information in the medical world
typically means that the action has a causal relationship with
desired patient outcomes, and ideally this causal relationship
has been proven by randomized clinical trials on a large popula-
tion [
]. In contrast, AI models make personalized, data-driven
predictions. The validity and actionability of AI suggestions, there-
fore, are counter-intuitive to many clinicians [
]. Studies showed
that, even when clinicians understood how an AI works, they
could not decide whether to take its suggestion. Instead, clin-
icians requested results from prospective randomized clinical
trials that can show the causality between the AI features and
patient outputs. They wanted these results to come from top
medical journals [
]; They wanted the information to
be trustworthy inherently, rather than requiring them to validate
it in the midst of their clinical decision-making [
]. For most
clinical AI models, such evidence does not exist.
Recent research has started improving AI explanation designs to
address these challenges. Some worked to make uninterpretable
models more interpretable by appending purpose-built explanation-
generation models to the DST [
], or oering counterfactual pre-
dictions demonstrating how changes to the values of AI features
may impact their outputs [
]. Others addressed the too
much information, too little time problem by presenting only se-
lective or modular explanations to clinicians, the explanations that
best justify the AI’s suggestion [
] or can best address clinicians’
potential biases [
]. These approaches are nascent and have
not yet demonstrated an impact on clinicians’ decision quality.
Harnessing Biomedical Literature to Calibrate Clinicians’ Trust in AI Decision Support Systems CHI ’23, April 23–28, 2023, Hamburg, Germany
Interestingly, for few other researchers, the challenges listed
above suggest that calibrating clinician-AI trust by explaining the
AI is a false hope entirely; clinicians need dierent kinds of infor-
mation to validate AI suggestions [
]. These researchers
argued that AI explanations’ lack of actionability—one cannot de-
cide whether an AI suggestion is right or wrong, even if one fully
understands how the AI generated its suggestion—is a fundamental
shortcoming that better explanation designs cannot x. Instead
of improving on the explanation designs, they promoted the idea
that designers should take a step back and work to understand
clinicians’ information needs in validating AI suggestions, with
considerations of their identity, use contexts, social contexts, and
environmental cues [
]. This call for action is a key motivator
for this research.
2.2 How Clinicians Validate Clinical
Suggestions in Practice
To bootstrap our investigations into how clinicians validate care
suggestions in their daily practice, we reviewed the literature on
clinical hypothesis testing and decision-making. We noted three
themes in this work: evidence-based best practices, the interpretive
nature of many clinical decisions, and tool use.
Best practices: Evidence-Based Medicine (EBM). A culture of
best practice dominates the medical world, therefore medical text-
books and best practices can oer some indications of clinicians’
actual practice. Known as Evidence-Based Medicine (EBM), clini-
cal best practices require clinicians to use "clinical evidence" when
examining patient-care-related suggestions and making decisions.
"Clinical evidence" includes, for example, results of randomized
controlled trials, standard care procedure concluded from multiple
trials, the biological relationships between patients’ genome proles
and their treatment outcomes, and expert opinions on noteworthy
case studies [
]. Biomedical literature is central to
EBM best practices, because it documents all of these clinical evi-
dence across medical domains [
]. In medical schools, students
learn to search for and user evidence in the literature to validate
diagnostic or treatment hypotheses [30].
Medical textbooks instruct clinicians to, before adopting a piece
of clinical evidence, examine both its rigor (How eectively did it
establish the causal eect of the intervention and patient outcome?)
and its applicability (Does the study result apply to their particular
patient?) [
]. In examining rigor, clinicians should prioritize results
from large-scale cohort studies over those from published case stud-
ies, according to the best practice known as the Level of Evidence
Pyramid [
]. In examining applicability, clinicians should identify
evidence that ts their patient’s
opulation, the
ntervention un-
der their consideration, intervention
omparator, and the patient
outcome of their interest, known as the PICO framework.
Empirical research showed that clinicians indeed use literature
for decision-making in practice [
]. When clinicians disagreed
on a diagnosis or treatment, they often used literature as a shared
source of truth [
]. The ways in which clinicians reacted
to some AI explanations (e.g., requesting evidence of causality, see
section 2.1) can seem reective of these best practices.
Literature-based decision-support tools. When validating patient-
care-related hypotheses or suggestions, clinicians most often used
literature-based tools. These tools reect both the EBM best prac-
tices and their need to rapidly mine seas of literature. For example,
clinicians across disease areas use PubMed, a biomedical litera-
ture search engine [
], which allows clinicians to search for
clinical evidence based on their levels of rigor (e.g., randomized
controlled trials, controlled trials, cohort studies, etc.) Many also
used manually curated literature digest apps such as UpToDate [
and Journal Club [
]. In addition, some hospitals hire in-house
clinical librarians to assist clinicians in the literature search for
decision-making [23, 37].
Recently, (semi-)automated literature digest tools have emerged,
thanks to the rapid advances in biomedical Natural Language Pro-
cessing and pre-trained language models such as GPT-3 and BERT [
For example, healthcare researchers have started building literature
tools that search for clinical trials that match a patient situation
based on the PICO framework [
]. Such tools are available
across multiple disease areas, such as cardiology, psychiatry, and
Social decision support and the interpretive nature of real-
world clinical decisions. Notably, the situated and interpreta-
tive nature of many clinical decisions requires knowledge beyond
what’s proven eective in past patient cases or clinical trials [
For example, for patients who arrived in Emergency Rooms while
crushing and dying, clinicians often had to make decisions despite
incomplete information, not to mention experimentally-proven
evidence. In end-of-life treatment decision-making, the patient’s
personal value is also often weighed vis-à-vis scientic evidence of
treatment outcome [
]. In these decision contexts, clinicians re-
sisted data-driven decision evidence provided by AI and preferred
"social decision support" from colleagues: consulting each other and
drawing from others’ tacit knowledge and experience [28, 75].
However, little empirical research has investigated how clin-
icians validated decision hypotheses (through literature, phone
consultation, or others) and made care decisions as it naturally
occurs. Instead, most work simply noted whether clinicians did
or did not deviate from the EBM best practice [
]. This line
of research identied clinicians’ lack of time as one of the most
signicant barriers to practicing EBM at the point of care [15, 33].
We wanted to explore new DST designs that can help clinicians
validate its AI’s diagnoses and treatment suggestions on a case-by-
case basis, convincing clinicians to take AI’s correct advice while
rejecting its errors. Prior research has created many thoughtful DST
designs by explaining how the AI worked. However, such explana-
tions can seem vastly dierent from what clinical best practices in-
struct clinicians to seek when deliberating a care decision. Our work
attempts to bring these strands of related work together. We rst
worked to understand how clinicians validated each other’s care
suggestions in current practices (study 1). Building upon this empir-
ical understanding, we then investigated how similar approaches
might help clinicians validate AI’s advice (study 2).
CHI ’23, April 23–28, 2023, Hamburg, Germany Yang et al.
Study 1 Clinical Domain Experience Study 2 Clinical Domain Experience
P1 Neurology 10-30 yrs P13 Internal Medicine Over 30 yrs
P2 Cardiology 10-30 yrs P14 Medical Student 2-5 yrs
P3 Pediatric Neurology 10-30 yrs P15 Psychiatry 2-5 yrs
P4 Cardiology 10-30 yrs P16 Psychiatry 1-2 yrs
P5 Neurogenetics 10-30 yrs P17 Medical Informatics 1-2 yrs
P6 Nephrology 10-30 yrs P18 Pediatric ER Over 30 yrs
P7 Pathology 5-10 yrs P19 Pharmacology 5-10 yrs
P8 Pediatric Nephrology 10-30 yrs P20 Nursing 1-2 yrs
P9 Clinical Librarian 10-30 yrs P21 Family Medicine 10-30 yrs
P10 ER 2-5 yrs
P11 Clinical Librarian 10-30 yrs
P12 Clinical Librarian 5-10 yrs
Table 1: Study participants.
3.1 Stage 1: Investigating Natural Interactions
The rst study aims to understand how clinicians examined each
other’s clinical suggestions in practice. Because biomedical liter-
ature is known to play a central role both in clinicians’ every-
day decision-making and their rationale for resisting AI sugges-
tions [
], we chose to focus on how clinicians sought, priori-
tized, and synthesized information from the literature to validate
care suggestions. We paid particular attention to whether and how
clinicians’ behaviors deviated from best practices under the time
pressure of patient care.
We conducted IRB-approved semi-structured interviews (with
components of contextual inquiry) with 9clinicians and 3clin-
ical librarians who these clinicians routinely hired to help with
decision-making. We intentionally recruited participants from dif-
ferent clinical roles and specialties, ranging from nephrology physi-
cians (P1, P6), pediatric physicians (P3, P8), cardiology physicians
(P2, P4), a pediatric geneticist (P5), an emergency medicine medical
intern (P10), and a pathologist (P7). We recruited initial participants
from our collaborating hospitals and then expanded the set through
snowball sampling. We conducted the interviews remotely. Each
interview lasted for about 60 minutes.
In each interview, we started by inviting the participant to re-
call in detail a recent experience where they needed to examine a
hypothesis or suggestion regarding patient care. We asked them
to describe the specic patient situation and the broader clinical
contexts. If and when they started describing their literature search,
we invited them to share their screen and (re-)perform their search
using the tools they used in actual practice (e.g., PubMed, Google
Scholar.) We invited them to think aloud, detailing how they sought,
read, interpreted, and synthesized various information (within and
beyond the literature) in relation to the specic patient decision
and hypothesis in question. We asked follow-up questions to better
understand the motivations that drive their information needs and
advice-taking or -rejection behaviors.
We screen/video recorded and transcribed all interviews. We
analyzed the data using a combination of anity diagrams [
service blueprinting, and axial coding [
]. Through anity diagram-
ming, we analyzed (1) what information clinicians sought from the
literature, (2) how they synthesized the information for the situ-
ated and interpretative patient situation at hand, and (3) how they
prioritized such information when under time pressure. Further,
service blueprinting allowed us to trace theses information ow
across the patient situation in question, clinicians’ literature search,
and clinicians’ multiple decisions over the course of a patient’s care.
Next, we performed axial coding [
] to consolidate the perspec-
tives of the clinicians who gave and received clinical suggestions.
Finally, we conrmed our ndings with four additional clinical
professionals (a professor of biomedical informatics, a professor of
medicine, and two orthopedic surgeons.)
3.2 Stage 2: Imitating Natural Interactions
After understanding how clinicians naturally validated each other’s
suggestions, we aimed to examine whether similar interactions
can help clinicians validate AI suggestions. Towards this goal, we
designed and prototyped a new form of DST that imitates clinicians’
naturalistic interactions observed in study one. We then used the
prototype as a Design Probe [
], conducting IRB-approved in-
terviews with clinicians for feedback and also iteratively improving
the design based on their feedback.
We prototype three versions of the DST design, each focusing
on one disease area (neurology, psychiatry, and palliative care.) To
create a realistic user experience (UX), we populated the prototypes
with three retrospective patient cases from top medical journals.
We worked with two clinicians in selecting the cases and removed
privacy-sensitive information from the clinical narratives. Finally,
we populated the DST prototypes with AI diagnosis or treatment
suggestions for these patients, using previously published, open-
source ML models [68, 69].
To elicit honest feedback on the design, we recruited additional 9
clinicians who had not been interviewed earlier. We used the same
recruitment process as in stage one. Study 2 participants come from
Internal Medicine, Psychiatry, family medicine, emergency care,
among other domains.
Harnessing Biomedical Literature to Calibrate Clinicians’ Trust in AI Decision Support Systems CHI ’23, April 23–28, 2023, Hamburg, Germany
Decision Suggestions under Deliberation Evidence Embedded in the Literature
Decide what information to inquire from a patient and
actions to take during a patient visit;
Latest standard care procedures for particular symptoms; etiology;
Interpret lab test results and make diagnosis The sensitivity and eectiveness of the lab test; dierential diagnosis;
Make prognosis and treatment decisions
Dierent treatment options’ anticipated outcomes based on outcomes of past
patients similar to the patient situation in question (in terms of, for example,
Table 2: Clinicians share various literature evidence when they suggest dierent clinical actions to colleagues. The left column
lists the clinical actions. The right column lists the information in the literature clinicians considered as evidence for that
In each interview, we rst described the need for clinicians to
deliberate whether to accept an AI-generated care suggestion. With-
out showing our prototype, we probed the participants’ reactions
to the idea of validating AI suggestions with literature. These re-
actions reect clinicians’ unbiased opinions regarding validating
AI suggestions with literature. Next, we showed the participants
a prototype that matched their medical expertise. We invited the
participants to think aloud while they read the patient information,
DST suggestions, and literature evidence. We asked follow-up ques-
tions to investigate how the literature evidence helped (or failed to
help) them deliberate whether to accept the AI suggestions. Finally,
we collected additional feedback by inviting clinicians to improve
the prototype. We audio and video-recorded the interviews and
analyzed the data using the same methods as Stage 1 (see 3.1).
In study one, our interviews conrmed that clinicians use biomedi-
cal literature to validate colleagues’ care suggestions and to answer
their tough questions”, especially around rare patient cases. The
literature is clinicians’ shared source of truth. While all clinicians
we interviewed said that they were very very busy and did not
have time to read, almost all had looked up literature within the
past two weeks, some even within the past few days.
Below we describe ndings around the kinds of information
clinicians used to deliberate care suggestions: (1) Clinicians provide
evidence that can validate (or invalidate) their suggested diagnosis or
treatment, rather than explanations of the suggestion; (2) Clinicians
consider as a piece of perfect evidence as evidence that ts the
patient situation at hand perfectly”. Such evidence rarely exists;
(3) When the perfect evidence does not exist, clinicians curate and
synthesize imperfect evidence to make decisions; and (4) Clinicians
utilize time-saving strategies to perform ecient reading, curation,
and synthesis.
4.1 Don’t Explain. Provide Evidence Instead.
When exchanging clinical suggestions, clinicians rarely explained
how they came up with the suggestions (e.g., a suggested diagnosis,
a lab test order, or a medication order). Instead, they shared a list
of evidence from biomedical literature. Table 2 summarizes the
clinical suggestions clinicians have referenced literature for and the
information clinicians considered as evidence of the suggestions.
All clinicians we interviewed made an explicit eort to seek and
share a list of evidence that is scientic. The resulting list is scien-
tic in two ways. Firstly, it is comprehensive. Clinicians worked to
ensure that they provided all supporting and opposing evidence
of their suggestion in the literature. For example, when suggesting
a treatment via email, one clinician shared the entire (literature)
search results to demonstrate the evidence list is unbiased by rank-
ing algorithms (P9). Others highlighted the latest literature on the
list, signaling that they included all the latest data and treatment
options (P1, P2). Secondly, the evidence list is reproducible. Clin-
icians often shared the link they used for their evidence search
such that their colleagues can get the same results each time when
examining the care suggestion and its list of evidence.
The comprehensive and reproducible evidence list served dual
purposes in advice-giving-and-taking among clinicians. The list
is persuasive. It demonstrated that the best evidence available
supports the clinician’s suggestion. On the other hand, the list
allowed the clinician’s colleagues to scrutinize the supporting and
opposing evidence rst-hand and draw their own conclusions.
4.2 The Myth of the Perfect Evidence
We use the term perfect evidence to refer to the kind of information
that can immediately trigger clinicians’ acceptance or rejection of
a piece of clinical advice. In alignment with medical textbooks,
clinicians characterized the perfect evidence as both rigorously
generated (robustness) and highly applicable to the patient situation
they face (applicability). However, interestingly, their criteria for
evidence applicability far exceed what textbooks require.
When validating the rigor of a piece of literature evidence, clini-
cians used several proxies, such as the size of the patient population
it has been tested on, publication venues, and authors. Some also
checked the disclaimer part of the literature, assessing whether the
clinical evidence on branded treatments was published for prot
rather than in the patient’s best interest (P2). Overall, the ways in
which clinicians validated evidence rigor aligns with clinical best
practices, though much faster. It’s the reviewers’ job, not mine, to
invest in the time to see if the study is legit. (P8)
In validating the applicability of a piece of literature evidence,
however, clinicians spent ample time and much cognitive energy.
When reading clinical trial reports and case studies, almost all clin-
icians we interviewed jumped right to the rst table in the article,
which describes the demographic information of the patient or
CHI ’23, April 23–28, 2023, Hamburg, Germany Yang et al.
patient population. They then worked on mapping the patient pop-
ulation onto their patients’ situation. [I] looked at what happened
to these patients under what particular circumstances in the past,
and sort of deciding whether they could apply to my patient. (P6)
Some did this mentally. Others, especially the clinical librarians
who routinely handle unusually complex patient cases, used the
PICO (Population, Intervention, Comparator, Outcome) framework
to match patient situations explicitly.
A pursuit of literature evidence that matches their patient’s situ-
ation perfectly drives clinicians’ search. Their denition of perfect
applicability far exceeds what best practices demand (e.g., PICO
match.) To all clinicians we interviewed, perfection means that
the patient population in the clinical trial matches not only their
patient’s demographics but all characteristics of the patient (P4),
including age, sex, weight, presence of comorbidities, past disease
trajectories, previous interventions, and even family history. P8,
a pediatric nephrologist, provided a particularly vivid example.
When treating a preschool-aged female patient with Liddle syn-
drome and a family history of aldosterone-mediated hypertension,
P8 suspected that the patient might have monogenic hypertension
(a type of genetic high blood pressure.) He had run preliminary
laboratory tests and looked up literature to decide what other tests
he should order to conrm the diagnosis. He was looking for a
testing procedure that has been proven eective on patients of the
same gender and age, with the same disease variation (Liddle syn-
drome and monogenic hypertension), family history, and genetic
inheritance pattern (e.g., the disease was similarly symptomatic
in some family members). The evaluation of the suggested testing
procedure had to t within those parameters to convince him to
carry out the suggestion immediately.
The perfect evidence does not exist for a vast majority of pa-
tient cases, especially for treatment suggestions. Combining the
various criteria clinicians outlined above, the perfect evidence for
the treatment of a patient is the treatment evaluation result from a
large-scale, well-conducted randomized controlled trial, in which
the trial participants are similar to the patient at hand in all charac-
teristics. This is an exceptionally high bar, if at all achievable. Fur-
ther, clinicians only seek help on unusual patient situations or rare
diseases, making the perfect clinician-evidence-patient-situation
match even less likely.
4.3 Making Decisions with Imperfect Evidence
Clinicians acknowledged that perfect evidence that can immediately
trigger action is rare. After conrming that the perfect evidence
does not exist for their patient, clinicians collected pieces of imper-
fect evidence; evidence that is useful but non-conclusive (P6). You
want to look for something else also. (P10)
Clinicians undertook a three-step process to identify and syn-
thesize the list of imperfect evidence to conclude a care decision.
First, clinicians collected a set of evidence from less heavy hitting
journals. For example, they looked for insights from cohort studies,
case-controlled studies, and case reports (rare patient cases that
other clinicians encountered and published in peer-reviewed jour-
nals) (P4, P6, P10). When necessary, clinicians even referenced what
they described as obscure (P3) or gray literature (P12), such as
posters and conference meeting abstracts.
Next, if even less prestigious literature failed to reveal perfectly
applicable evidence, clinicians relaxed their evidence applicability
criteria. Using the PICO framework as a scaold, clinicians rst
looked for studies or historical cases whose patient population and
study setting overlap most with the patient situation they face.
They referred to such evidence as "second-best matches". They then
relaxed the search criteria further (for example, by including studies
that cover a slightly dierent age group or a dierent subcategory
of the same disease) until they had exhausted applicable evidence.
Finally, clinicians worked to identify a trend or narrative that
can connect the pieces of evidence they collected from diverse
sources with dierent levels of applicability. They again used the
PICO framework as a scaold. For example, to conclude a treatment
decision, clinicians rst guesstimated the likely treatment out-
come of a patient population that shares some characteristics with
their patient. They weighed competing evidence about the same
population dierently according to its robustness. Next, clinicians
repeated this synthesis process on a patient population that shares
another set of characteristics with their patient. This iterative pro-
cess continues until they see a trend in the likely treatment outcome
when these various patient characteristics meet. As P6 describes,
It’s essentially looking at that and sort of deciding what happened to
those patient [...] if they could apply to my patient." Clinicians took
action on their patient according to this evidence trend.
It is not a coincidence that, when clinicians failed to nd the
perfect evidence, all of them relaxed the robustness criteria to look
for the second-best patient-situation matches. When synthesizing
evidence, clinicians used the PICO patient-situation match as a scaf-
fold. Overall, the clinicians prioritized applicability over robustness
when seeking and synthesizing evidence. They considered a highly
applicable case study more informative than a well-conducted, large-
scale randomized trial on a dierent patient population. P3’s rec-
ollection of a recent experience punctuates the length clinicians
would go to in order to harness a piece of less-robust but highly
applicable information. When exploring treatments for a rare dis-
ease in a patient, P3, a neurologist, came across an ongoing clinical
trial that ts the patient’s demographics and history. He went on to
contact a friend who works at the same hospital as the author (the
clinician who was leading the trial), got introduced to the author,
sought their tips and tricks on how we are going to be successful with
this [disease management]”, went on to contact the drug company,
and nally got the drug under trail for the patient. Clinicians value
the evidence that ts their patient’s situation well, in this case, even
when the evidence is still under development.
With Dr. [name], I wanted to know if he thought the
drug would work. And then, if it did, how do I get it?
How do I use it? What do I need to do? What resources
do I need? How we are going to be successful with this?
4.4 Time-Saving Strategies
The interview ndings so far have shown that, when deciding
whether to take a clinical suggestion, clinicians set an exceptionally
high standard for decision evidence and thoughtfully curated and
synthesized a comprehensive list of imperfect evidence when this
standard is too high to reach. Noteworthily, clinicians also carried
Harnessing Biomedical Literature to Calibrate Clinicians’ Trust in AI Decision Support Systems CHI ’23, April 23–28, 2023, Hamburg, Germany
out a set of time-saving strategies to make this process as ecient
as possible.
Reading nothing but the evidence, its robustness, and its ap-
plicability. During the contextual inquiries, no clinicians read a
clinical trial or case study entirely. Instead, they skipped to the pop-
ulation description to assess the study’s applicability, then skipped
to the Method section to validate its robustness. If the study met
both criteria, clinicians skipped to results to nd out whether the
result was positive or not (P6). Following this process, it takes
clinicians less than one minute to extract the evidence they need
from an article of dozens of pages.
The only thing that matters to me is that: No survival
benet with the addition of ADT (a medication). I won’t
read into the p-values and the percentages. All I care
about is whether [outcome] is positive. (P10)
Clinicians also deemed literature with neutral outcomes as ir-
relevant because the information is not actionable. They stopped
reading as soon as they realized the study outcome was neutral.
Synthesizing evidence (only) until it can justify action. While
clinicians’ literature synthesis for decision-making was compre-
hensive and systematic, the synthesis only went as far as it could
justify a clinical action. For example, if the suggested clinical action
is to order an additional lab test to conrm a potential diagnosis,
clinicians only need a relatively weak set of evidence to take the
suggestion because the cost of running the extra test is low, while
the cost of missing a diagnosis is high. In this sense, clinicians in-
form patient decision-making with literature evidence much faster
than a literature review process in research settings.
Reuse literature evidence over the course of a patient’s care. Be-
cause clinicians sought literature evidence that matched the patient
situation at hand, the same set of literature can inform multiple clin-
ical decisions over the course of the patient’s care (Table 2.) When
rst meeting the patients, clinicians accumulated "landmark paper"
that can oer a rough guideline for care decisions but do not en-
tirely match the patient’s situation (P1, P6). As clinicians gathered
more understanding about the patient situation (e.g., via diagnostic
tests), they gradually turned to more precisely relevant clinical
trials and even historic patient cases. The pieces of evidence they
collected along the way progress from broad-stoke to case-specic,
and all can potentially inform subsequent care decisions.
We have so far described how clinicians examined each other’s
suggestions in current practice. Now we shift to consider how these
naturalistic interactions inform how DSTs can help clinicians more
eectively examine AI suggestions.
Recall that existing DSTs calibrated clinician-AI trust primarily
by explaining to clinicians how the AI has generated its suggestions.
Interestingly, clinicians used a very dierent type of information
when deliberating on each other’s clinical suggestions: biomedical
literature. Specically, we highlight four characteristics of literature
information that made it eective.
Providing evidence, rather than explanations. Clinicians
deliberated on colleagues’ care suggestions, not by how the
colleague came up with the suggestion, but based on external
evidence concluded from entirely dierent processes (e.g., clini-
cal trials). Using this information, clinicians could stay focused
on analyzing whether this is the right diagnosis or treatment
for this patient" rather than around whether the suggestion is
trust-worthy by itself.
This providing-supporting/opposing-evidence approach can po-
tentially address several challenges AI explanations face in cali-
brating clinician trust in AI suggestions. First, external evidence
that can support or oppose a suggested diagnosis/treatment is
always available, regardless of whether the AI models are in-
terpretable or not. Second, it can save clinicians’ time because
clinicians are likely to be much more ecient in analyzing
whether this is the right diagnosis or treatment for this patient"
than scrutinizing AI’s inner workings. Finally, it can keep clini-
cians’ cognitive work on clinical decision-making rather than
AI-explanation sense-making.
Providing both evidence of robustness and applicability,
respectively.Prior research has shown that clinicians did not
consider explanations of how an AI generated its suggestions
as actionable. The explanations often failed to sway clinicians
to accept or reject the AI’s suggestion [
]. But why? Our
ndings revealed that a combination of two types of information
could trigger clinicians’ advice-taking behavior: evidence of the
suggestion’s robustness (also known as internal validity [
and of its applicability to the patient situation at hand (“external
validity”). However, the perfect evidence that indicates both
rarely exists. Therefore, clinicians assessed these two types
of evidence separately: A study’s population size indicates its
robustness. Its patient population match indicates applicability.
Existing DST designs did not always make this distinction. In-
stead, many DSTs unknowingly mixed evidence of its AI sug-
gestion’s robustness (e.g., evidence that the model is high per-
forming on the patient population it is originally trained on)
with evidence of its applicability (e.g., evidence that the model
is high performing on a patient population that perfectly re-
sembles the patient in question.) Further, evaluating a model’s
performance on patients like this patient in all characteristics is
rare and challenging. In this light, it can seem unsurprising that
AI performance indicators also struggled to persuade clinicians
to accept or reject AI suggestions.
Providing comprehensive rather than selective evidence. A
critical challenge in DST design is to calibrate clinicians’ trust
in AI under the time pressure of clinical practice. Some DST
designs addressed this challenge by providing clinicians with
selected information about the AI, such as a subset of features
that can best justify the AI suggestion [
]. Our interview
ndings suggest an alternative perspective to this approach. To
clinicians, the trustworthiness of a clinical suggestion comes
from the fact that the entirety of its supporting evidence, in
aggregate, outweighs the entirety of its opposing evidence. Clin-
icians saved time by reading only the critical takeaway from
every piece of evidence rather than reading particular pieces
of evidence. DSTs might play a more critical role in calibrating
clinician trust if it oered a comprehensive but succinct set of
evidence and counter-evidence.
CHI ’23, April 23–28, 2023, Hamburg, Germany Yang et al.
Figure 1: An initial prototype of the new DST design. This design helps clinicians deliberate whether to take an AI’s suggestion
by providing a list of supporting and opposing evidence from biomedical literature. The top panel highlights the patient’s
basic information (e.g., demographics). The columns on the left overview the patient’s medical history and test results (in this
example, psychiatric evaluation results). AI’s treatment suggestion, along with its supporting and opposing evidence from
biomedical literature, are on the right.
Drawing the evidence from a shared source of truth. When
a clinician shared a suggestion with a colleague, they used
biomedical literature because it is a shared source of truth for
both the advice-giver and the advice-taker. As a result, the
clinicians did not need to elaborate on the trustworthiness of
the literature evidence itself; they were both aware, for example,
that randomized trials are more robust than historical case
studies. Instead, they used the literature as a shared language
and both focused on assessing the evidence’s applicability to
the specic patient situation at hand.
The same cannot be said about AI explanations. Explanations
around AI’s inner workings are the language of the AI (the
advice-giver) rather than that of the clinicians (the advice-taker.)
The struggles clinicians had in interpreting AI explanations
can be seen as a symptom of this lacking-a-shared-source-of-
truth problem (for example, they wanted evidence of causality
between AI’s features and patient outputs [74, 75]).
5.1 Design Goal and Strategies
We wanted to design a new form of DST that can help clinicians
more eectively examine AI suggestions, such that they take only
the correct suggestions and reject the AI errors. Embracing clini-
cians’ naturalistic interactions, we identied the following design
Drawing the evidence from information sources that clinicians
already understand and trust.
Providing evidence for/against the AI’s suggestion, rather than
explaining the AI’s inner workings.
Providing evidence of the AI suggestion’s rigor and applicability
Providing a comprehensive set of evidence and counter-evidence,
but present each piece of evidence concisely.
5.2 The Design
Next, we translated the design strategies into a concrete interaction
design and an initial prototype. Given the focus of this research, we
focused on designing the part of the DST interface that calibrates
clinicians’ trust in AI suggestions (Figure 1, red arrow.)
Drawing the evidence from information sources that clini-
cians already understand and trust. We chose to use biomed-
ical literature as the source of AI explanation information
because clinicians across disease areas trust it as a source of
decision evidence.
Providing evidence for/against the AI’s suggestion, rather
than explaining the AI’s inner workings. Using the PICO frame-
work, the new DST design retrieves biomedical articles that
match the patient’s situation (
opulation) and the clinical ac-
tion that the AI suggests taking (
ntervention, intervention
comparator, and outcome of interest).
Providing evidence of the AI suggestion’s rigor and appli-
cability respectively. We included various types of literature,
from randomized clinical trial reports (rigorous but imprecise
Harnessing Biomedical Literature to Calibrate Clinicians’ Trust in AI Decision Support Systems CHI ’23, April 23–28, 2023, Hamburg, Germany
patient population match) to historical case studies (potentially
precise patient situation matches, but less rigorous.)
Provide a comprehensive set of evidence but present each
piece of evidence concisely. While presenting a comprehen-
sive list of relevant articles, we summarized each in just a single
sentence. Although drawn from study 1 observations, this de-
sign choice echoes existing counterfactual-based [
] and
modular AI explanation designs [
]. Noteworthily, our design
diers in that its evidence comes from experimentally-proven
facts rather than probabilistic AI predictions.
The design of the remainder of the DST interface simply followed
the lessons from prior research [
]. The resulting DST interface
has three parts: Occupying most of the interface is the patient’s
demographic information, medical history, lab test results, and other
EHR information. On the right side is AI’s personalized diagnostics
or treatment suggestion. Because we wanted to probe clinicians’
feedback towards our design strategy providing evidence rather
than AI explanation”, the prototype did not include any explanation
of AI performance indicators or inner workings. This allowed us to
observe whether they requested to use this information.
We populated the prototype with literature information using a
combination of multiple functioning machine learning (ML) mod-
els and Wizard of Oz (WoZ, only for prediction quality control).
First, we searched for relevant articles using a purpose-built search
engine that is built upon EBM-NLP, a PICO-based biomedical liter-
ature dataset [
] and clinical bioBERT [
]. We summarized each
literature article by leveraging the extractive summarization func-
tion of GPT-3 (davinci-001), a pre-trained large language model [
Finally, we as wizards" manually corrected the few errors in these
ML-curated results (because the focus of this study is not to evalu-
ate these ML models’ document retrieval or summarization abilities.
Instead, our goal is to probe whether our design goals and design
choices are promising directions as such language models become
increasingly capable.
In study two, we interviewed an additional 9clinicians for their
feedback on the new DST design and collaboratively improved the
prototype. Our ndings conrmed that the design strategies drawn
from clinicians’ naturalistic suggestion-validation interactions can
help calibrate their trust in AI suggestions. Moreover, clinicians’ in-
teractions with the prototype also revealed new design and research
opportunities around (1) harnessing the complementary strengths
of literature-based and predictive decision supports; (2) mitigating
risks of de-skilling clinicians; and (3) oering low-data decision
support with literature.
6.1 Conrming the Design Strategies
Literature evidence is a valuable complement to “big data”. All
clinicians we interviewed welcomed the shift of focus from explain-
ing AI’s inner workings to providing clinical evidence of the AI’s
suggestions. While the prototype included no explanations of the
AI or its performance indicators, only one clinician requested some
explanation (She requested to see which hospitals’ patient data the
AI suggestions are trained on.) In contrast, clinicians described the
causality-focused literature evidence as a valuable complement to
the correlation-driven big data predictions”. They analogized AI
predictions to aggregating many, many experiences of many doctors
while describing literature evidence as embodying the mantra of
Evidence-Based Medicine. The former resembles clinicians’ tacit
practical experience, while the latter oers proven knowledge and
what“most doctors decided was a good decision. Both cover critical
aspects of clinical decision-making.
There are dierent kinds of information that can in-
form your decision. [...] There’s evidence-based medicine
(EBM). [...] But experience is also very important. Mind
you medicine (practice) is not really like mathematics.
It is not an accurate science. It always has surprises and
the common knowledge that people have from other
doctors.”. (P21, family medicine physician)
Clinicians working in interdisciplinary medical domains (e.g.,
emergency care, family medicine, internal medicine) particularly
appreciated the literature evidence. These domains cover a very,
very large area of knowledge”. Literature evidence can help them
examine a broader range of decision hypotheses and AI suggestions.
Clinicians highly valued the evidence of AI suggestion appli-
cability. Echoing ndings from study one, the prototyping study
punctuated the critical importance of AI suggestions’ applicabil-
ity. Recall our initial prototype oered a brief summary of each
biomedical article’s ndings. The clinicians we interviewed instead
demanded an explicit description of the patient situation the article
focuses on, even if it means more texts to read (See Figure 2 for
the revised design). In order to more easily examine the literature
evidence’s applicability, some requested us to color code the dif-
ferent characteristics of the article’s population of focus and to
highlight the corresponding characteristics in the current patient’s
EHR information (P14, P15, P17). How the medicine ts into the
patient’s narratives is important. (P17).
I need to gure out how to treat this really complex
disease . . . it’s the treatment for one can kill the patient
with the other”. (P15, Emergency Room resident)
Clinicians demanded a concise summary of a comprehen-
sive set of evidence. All clinicians we interviewed expected the
literature evidence to be comprehensive. They considered compre-
hensiveness to be a key strength of literature. For example, while
seeing big data as capturing patient outcomes of the lab tests
and treatments that have long been available, clinicians considered
literature as capturing the latest etiology, tests, and treatment op-
tions that can update their knowledge. The comprehensiveness of
literature evidence is particularly valuable to clinicians when they
discuss patient cases with colleagues and mentees.
(I want to use this tool) when I’m presenting this [pa-
tient] case on the rounds to somebody, just put my
thoughts together and make myself sound more on top
of things. There is an element of showmanship in rounds,
where you are trying to show how smart you are, that
CHI ’23, April 23–28, 2023, Hamburg, Germany Yang et al.
Figure 2: The revised DST design resulted from a co-prototyping study with 9clinicians. The clinicians
found the literature evidence useful in complimenting the data-driven DST predictions.
you’ve thought everything, that you got everything un-
der control which you don’t. (P13, Emergency Room
Clinicians wanted the most concise summary possible while cov-
ering the literature evidence comprehensively. Ideally, it can just
give me some kind of alert”, an alert that the literature says what the
AI suggests and what they are about to do are wrong. The Emer-
gency Room clinicians we interviewed all requested the literature
evidence to be summarized into such an alert, because otherwise
You know, if it’s 2 o’clock in the morning and I’m seeing a patient with
phenomena, I’m not going to look at this. Other clinicians expressed
more appreciation of the detailed literature evidence, as they may
study it before meeting patients or after work (e.g., when you sit
at home at night, thinking through this patient.”) In the meantime,
they also wished for literature evidence in a condensed form such
that they could use it when sitting in front of patients and needing
to decide something quickly”.
Clinicians trust biomedical literature for debasing AI sugges-
tions. Our DST prototype intends to calibrate clinician-AI trust
in diagnostic and treatment decision-making with literature evi-
dence. Interestingly, clinicians appreciated the literature not only
in these decision contexts, but also in debasing the AIs’ and their
own decision-making. This sentiment conrmed that to clinicians,
literature is often a more trusted source of information than explana-
tions of AI’s inner workings. It also suggests that DSTs can leverage
literature for interactions beyond validating AI’s diagnostic and
treatment suggestions.
Literature can point out biases doctors made (in the AI
training data), like black patients didn’t get (cardiac)
catheterization as frequently as white patients did, and
women didn’t get evaluated for coronary diseases as
often as men did. Women present dierently with coro-
nary symptoms; they don’t describe [the symptoms]
using the same words that were used in the textbooks,
because the textbooks were based upon male patients.
[The literature] It’s very thorough. (P13, Emergency
Room physician)
6.2 Revealing New Design Implications
The preceding observations conrmed that the types of informa-
tion clinicians used to examine each other’s suggestions—external
evidence of the suggestions’ rigor and applicability—could also help
them examine AI suggestions. Noteworthily, biomedical literature
and our prototype design represented merely only one source of
Harnessing Biomedical Literature to Calibrate Clinicians’ Trust in AI Decision Support Systems CHI ’23, April 23–28, 2023, Hamburg, Germany
such external evidence and one way of presenting it. Clinicians’
suggestions for improving the prototype revealed additional oppor-
tunities in this newly opened design space.
Harnessing additional sources of clinical evidence: medical im-
ages and -omics data. The clinicians we interviewed requested
forms of clinical evidence beyond text-based literature summaries.
For example, multiple physicians requested image-based literature
evidence to help validate computer vision suggestions and make
image-based care decisions. The gold standard reference images
in pathology and dermatology literature can be valuable in these
decision contexts. Similarly, pharmacists expressed a desire for lit-
erature evidence that matches patients’ -omics data (patient’s
genetic or molecular proles, which indicate their likely disease
symptoms and medication responses).
Harnessing literature evidence as stand-alone decision support
for low-resource hospitals. Multiple clinicians suggested that the
literature evidence alone can be valuable for low-resource hospitals,
such as rural health clinics and veterans’ hospitals. While some of
these hospitals have only recently moved to EHR and may not have
deployed AI-based DSTs, the literature evidence can nonetheless
benet their workforce, who are likely to be less experienced and
in a more signicant shortage of physicians.
De-skilling Clinicians? A Risk of Calibrating
Clinician-AI Trust with Literature Evidence
Among the nine clinicians we interviewed, two (P13, P15) suggested
a concern over clinicians becoming over-reliant on clinical litera-
ture. While other clinicians characterized the literature evidence
as a trigger for self-reection and slow-thinking, these clinicians
analogized literature evidence to medical textbooks or even step-
by-step instructions.
P15 describing her ideal form of literature evidence:
It’s just kind of like someone made bullet points for
other providers saying. Hey, this is what we do.
In this context, they pushed back on seeing literature evidence
on the DST, as it might over time undermine clinicians’ ‘ability to
think on their feet at the point of care. A good physician should be
able to practice medicine in a power failure. P13 stated.
Clinical decision support tools (DSTs), powered by Articial Intelli-
gence (AI), promise to improve clinicians’ diagnosis and treatment
decision-making process. However, no AI model is always correct.
DSTs must enable clinicians to validate AI suggestions on a case-by-
case basis, convincing them to take AI’s correct suggestions while
rejecting its errors. Prior DST designs often explained AI’s inner
workings or performance indicators. This paper provided an alter-
native perspective to this approach. Drawing from how clinicians
validated each other’s suggestions in practice, we demonstrated
that DSTs might become more eective in calibrating clinicians’
trust in AI suggestions if they provided evidence of the suggestions’
robustness and applicability to the specic patient situation at hand.
Such evidence should be comprehensive but concise and come from
a shared source of truth with clinicians.
This approach oers a timely answer to HCI communities’ call
for a use-context-based approach to designing explainable AI [
Below, we rst discuss how our ndings point to new design oppor-
tunities around clinician-AI trust calibration and XAI. We then take
a step back and more critically reect on (2) the role literature-based
AI and patient-history-based AI can and should play respectively
in supporting clinician decision making.
Designing for Clinician-AI Trust Calibration
Like many others in the HCI community [
], we argue that
explaining how AI work cannot fall calibrate clinicians’ trust in
AI’s individual suggestions, even though it is indispensable in many
other contexts (e.g., AI accountability, regulation). To help draw this
distinction, we promote the terminology shift away from “XAI” as a
catch-all term, to the concept of trust-calibration interaction design”.
In doing so, we hope to open up a new design space that explores a
wide range of trust calibration information and interactions that go
beyond explaining AI’s inner workings or performance indicators.
The literature-based prototype presented in this paper provides one
example. Future DST design practitioners should further explore
this rich design space.
To jump start this exploration, our empirical ndings identi-
ed three characteristics of biomedical literature that have made
it eective in calibrating clinicians’ trust in AI. First, it contains
information that is inherently trustworthy to clinicians, an essential
requirement of clinical decision-support information [
]. Second,
literature evidence is a shared source of truth among clinicians of all
medical domains, therefore useful for real-world clinical decision
making which is often highly collaborative and interdisciplinary.
Third, unlike explanations of AI’s inner-workings, the literature
contains various information that collectively can support all types
of clinical decisions (e.g., etiology, diagnosis, treatment, medical
imaging interpretation). When caring for each patient, clinicians
make a series of interconnected decisions rather than discrete ones
(e.g., identifying the cause of a symptom overlaps with concluding
a diagnosis) [
]. Drawing from the same source of evidence across
these decisions reduces clinicians’ cognitive load and saves time.
Future research should explore new information sources that
share these characteristics (e.g., biomedical literature, medical im-
age references, genomics data) to address AI’s potential errors and
biases. Recent AI research that cross-checks EHR-based AI patient
outcome predictions against the biological relations between treat-
ment and outcome [
] and against clinician-authored Knowledge
Graph [4] can be seen as great examples in this direction.
At a higher level, we advocate for an infrastructural approach
to calibrate clinicians’ trust in clinical AI systems. Even though
each EHR-based AI most often supports one clinician in making
one decision a time, AI trust-calibration design should consider the
broader context that clinicians collaborate constantly and make in-
terconnected decisions. In this context, it is more eective for DSTs
to draw from a consistent source of information that can validate di-
verse AI suggestions (regardless of whether it is a computer vision
or a simple Bayesian model). DSTs can then adapt the information
presentation and interaction design to each decision, its particular
human context, and the AI system involved. Biomedical literature
and genomics data have started becoming part of the healthcare
CHI ’23, April 23–28, 2023, Hamburg, Germany Yang et al.
information infrastructure; Can DSTs harness this infrastructure
and support clinicians’ trust calibration in AI? What other informa-
tion sources can become such a rising tide that raises all (AI) boats?
These are exciting questions for future work to explore.
7.2 Blending Diverse Sources of Intelligence To
Support Clinical Decision-Making
We have discussed the design and research opportunities that har-
nessing literature as a source of trust-calibration information has
revealed. Parallel to these eorts should be work that critically
reects on the limitations of literature (or information that shares
some of its characteristics) we have observed. We discuss these
limitations against the two axes of dierence that have scaolded
the debates around what constitutes the best evidence for clinical
decision-making”. These axes oer a useful structure for future
research to investigate: How can DSTs best harness the com-
plementary strengths of literature-based and EHR-based decision
supports while ameliorating their weaknesses?
Evidence rigor and evidence applicability. The clinical world
famously values causality: every clinical action ideally has a causal
relationship with improved patient outcome, and the causality ide-
ally has been proven by randomized clinical trials on a large, diverse
patient population [
]. Clinicians’ focus on causality and large-
scale evaluation is often described as causing their resistance to
AI’s highly-personalized decision suggestions [25, 34, 74, 75].
Our study 1 ndings suggest that there is more to the story:
Clinicians value the external validity of decision evidence (Does
the information apply to this particular patient?) as much as its
internal validity (How rigorously was it generated? Was it validated
in a large patient population?) The perfect evidence is one that has
been proven in a large patient population that resembles their
patient’s situation 100%. Such evidence simply does not exist. It
is in this context that we do not see the success of our design as
an indication that literature evidence should replace explanations
of AI’s inner workings. Instead, clinicians appreciated our design
because it paired AI’s personalized suggestion (high applicability)
with causality-based literature evidence (high internal rigor). We
see an opportunity for future research to reect on and further
improve the way our prototype coordinated AI explanations and
literature evidence, so that they can together form the perfect
evidence for clinicians’ decisions.
Explicit knowledge and tacit knowledge. Evidence, whether it
is from past patients’ data or clinical literature, is not omnipotent.
The situated and interpretative nature of many clinical decisions
requires knowledge beyond what’s proven eective in past patient
cases or clinical trials [
]. But in what circumstances should one
trust clinicians’ experience”, or tacit knowledge, and when should
one instead force them to think slow and consider both the support-
ing and counter-evidence of their judgments? This is a question
of long-standing debates in medicine and in AI decision support
design [75].
This question also underlies the limitations of literature-based
evidence. On the one hand, how can literature- and EHR-based AI
systems enhance, rather than distract clinicians from their trained
intuitions or even de-skill them? The emergency room physicians
we interviewed wished our literature evidence system took the form
of an alert button Literature evidence in aggregate disagree with your
judgment”, pointing to an interesting direction for exploration. On
the other hand, how can literature-based AI better serve decisions
where clinicians’ intuition falls short? For example, one clinician
wished for a literature-based AI that reminds them of the gender
and racial biases in old clinical trial results as well as in their own
decision-making. By sharing these clinician critiques, we hope
to start a reective discussion about how HCI research can better
harness the specic types of information in the literature to enhance
clinician judgments in dierent ways.
This paper illustrates how clinicians used literature to validate
each other’s suggestions and presented a new DST that embraces
such naturalistic interactions. The new desig uses GPT-3 to draw
literature evidence that shows the AI suggestions’ robustness and
applicability (or the lack thereof). In doing so, we promote a new
approach to explainable AI that focuses on not explaining the AI
per se, but on designing information for intuitive AI trust calibration.
In the midst of explosive growth in Foundational Language Models,
this work revealed new design and research opportunities around
(1) harnessing the complementary strengths of literature-based
and predictive decision supports; (2) mitigating risks of de-skilling
clinicians; and (3) oering low-data decision support with literature.
Parallel to work developing this new design approach, there
should also be work critically examining it. Here, we also highlight
two limitations of this work that merit further research. First, while
the GPT-3-based prototype in this work was sucient for most
clinicians we interviewed, more work needs to understand the
errors such models can make. Today’s language technologies are
by no means perfect in processing biomedical texts. Such errors
can be particularly critical for Emergency Care, where clinicians
concise summaries of literature evidence without losing necessary
caveats and nuances. Second, this work studied clinicians from
14 dierent medical specialties. Future work should evaluate this
literature-based design in more disease areas to further understand
the scope of this generalizability.
The rst author’s eort is partially supported by the AI2050 Early
Career Fellowship. This work was supported by Cornell and Weill
Cornell Medicine’s Multi-Investigator Seed Grants (MISGs) “Lever-
aging Biomedical Literature in Supporting Clinical Reasoning and
Decision Making”.
[1] [n.d.]. Journal Club for iPhone/Android.
Alexis Allot, Yifan Peng, Chih-Hsuan Wei, Kyubum Lee, Lon Phan, and Zhiyong
Lu. 2018. LitVar: a semantic search engine for linking genomic variant data in
PubMed and PMC. Nucleic acids research 46, W1 (2018), W530–W536.
David Alvarez-Melis, Harmanpreet Kaur, Hal Daumé III, Hanna Wallach, and
Jennifer Wortman Vaughan. 2021. From human explanation to model inter-
pretability: A framework based on weight of evidence. In AAAI Conference on
Human Computation and Crowdsourcing (HCOMP).
Daniel M Bean, Honghan Wu, Ehtesham Iqbal, Olubanke Dzahini, Zina M Ibrahim,
Matthew Broadbent, Robert Stewart, and Richard JB Dobson. 2017. Knowledge
graph prediction of unknown adverse drug reactions and validation in electronic
health records. Scientic reports 7, 1 (2017), 1–11.
Harnessing Biomedical Literature to Calibrate Clinicians’ Trust in AI Decision Support Systems CHI ’23, April 23–28, 2023, Hamburg, Germany
Mary Jo Bitner, Amy L Ostrom, and Felicia N Morgan. 2007. Service Blueprinting:
A Practical Technique for Service Innovation. (2007).
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al
2020. Language models are few-shot learners. Advances in neural
information processing systems 33 (2020), 1877–1901.
Patricia B Burns, Rod J Rohrich, and Kevin C Chung. 2011. The Levels of Evidence
and their role in Evidence-Based Medicine. Plastic and reconstructive surgery 128,
1 (2011), 305.
Ruth MJ Byrne. 2019. Counterfactuals in Explainable Articial Intelligence (XAI):
Evidence from Human Reasoning.. In IJCAI. 6276–6282.
Carrie J Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry.
2019. "Hello AI": Uncovering the Onboarding Needs of Medical Practitioners for
Human-AI Collaborative Decision-Making. Proceedings of the ACM on Human-
computer Interaction 3, CSCW (2019), 1–24.
Feixiong Cheng and Zhongming Zhao. 2014. Machine learning-based prediction
of drug–drug interactions by integrating drug phenotypic, therapeutic, chemical,
and genomic properties. Journal of the American Medical Informatics Association
21, e2 (2014), e278–e286.
Deborah J Cook, Cynthia D Mulrow, and R Brian Haynes. 1997. Systematic
reviews: synthesis of best evidence for clinical decisions. Annals of internal
medicine 126, 5 (1997), 376–380.
Juliet Corbin and Anselm Strauss. 2014. Basics of qualitative research: Techniques
and procedures for developing grounded theory. Sage publications.
Azra Daei, Mohammad Reza Soleymani, Hasan Ashra-rizi, Ali Zargham-
Boroujeni, and Roya Kelishadi. 2020. Clinical information seeking behavior
of physicians: A systematic review. International Journal of Medical Informatics
139 (2020), 104144.
Guilherme Del Fiol, T Elizabeth Workman, and Paul N Gorman. 2014. Clinical
questions raised by clinicians at the point of care: a systematic review. JAMA
internal medicine 174, 5 (2014), 710–718.
Mariah Dreisinger, Terry L Leet, Elizabeth A Baker, Kathleen N Gillespie, Beth
Haas, and Ross C Brownson. 2008. Improving the public health workforce:
evaluation of a training course to enhance evidence-based decision making.
Journal of Public Health Management and Practice 14, 2 (2008), 138–143.
[16] Lilian Edwards and Michael Veale. 2017. Slave to the algorithm: Why a right to
an explanation is probably not the remedy you are looking for. Duke L. & Tech.
Rev. 16 (2017), 18.
Upol Ehsan, Pradyumna Tambwekar, Larry Chan, Brent Harrison, and Mark O.
Riedl. 2019. Automated Rationale Generation: A Technique for Explainable AI
and Its Eects on Human Perceptions. In Proceedings of the 24th International
Conference on Intelligent User Interfaces (Marina del Ray, California) (IUI ’19).
Association for Computing Machinery, New York, NY, USA, 263–274. https:
Shaker El-Sappagh, Farman Ali, Abdeltawab Hendawi, Jun-Hyeog Jang, and
Kyung-Sup Kwak. 2019. A mobile health monitoring-and-treatment system based
on integration of the SSN sensor ontology and the HL7 FHIR standard. BMC
medical informatics and decision making 19, 1 (2019), 97.
John W Ely, Jerome A Oshero, M Lee Chambliss, Mark H Ebell, and Marcy E
Rosenbaum. 2005. Answering physicians’ clinical questions: obstacles and po-
tential solutions. Journal of the American Medical Informatics Association 12, 2
(2005), 217–224.
Matthew E Falagas, Eleni I Pitsouni, George A Malietzis, and Georgios Pappas.
2008. Comparison of PubMed, Scopus, web of science, and Google scholar:
strengths and weaknesses. The FASEB journal 22, 2 (2008), 338–342.
A. R. Firestone, D. Sema, T. J. Heaven, and R. A. Weems. 1998. The eect of
a knowledge-based, image analysis and clinical decision support system on
observer performance in the diagnosis of approximal caries from radiographic
images. Caries research 32, 2 (Mar 1998), 127–34.
Gary N Fox and Nashat S Moawad. 2003. UpToDate: a comprehensive clinical
database. Journal of family practice 52, 9 (2003), 706–710.
Nasra Gathoni. 2021. Evidence Based Medicine: The Role of the Health Sciences
Librarian. Library Philosophy and Practice (e-journal) 6627 (2021), 1.
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan,
Hanna Wallach, Hal Daumé Iii, and Kate Crawford. 2021. Datasheets for datasets.
Commun. ACM 64, 12 (2021), 86–92.
Marzyeh Ghassemi, Luke Oakden-Rayner, and Andrew L Beam. 2021. The false
hope of current approaches to explainable articial intelligence in health care.
The Lancet Digital Health 3, 11 (2021), e745–e750.
Randy Goebel, Ajay Chander, Katharina Holzinger, Freddy Lecue, Zeynep Akata,
Simone Stumpf, Peter Kieseberg, and Andreas Holzinger. 2018. Explainable AI:
The New 42?. In Machine Learning and Knowledge Extraction, Andreas Holzinger,
Peter Kieseberg, A Min Tjoa, and Edgar Weippl (Eds.). Springer International
Publishing, Cham, 295–303.
Paul N Gorman and Mark Helfand. 1995. Information seeking in primary care:
how physicians choose which clinical questions to pursue and which to leave
unanswered. Medical Decision Making 15, 2 (1995), 113–119.
Trisha Greenhalgh. 2002. Intuition and evidence–uneasy bedfellows? British
Journal of General Practice 52, 478 (2002), 395–400.
Gordon Guyatt, Drummond Rennie, Maureen Meade, Deborah Cook, et al
Users’ guides to the medical literature: a manual for evidence-based clinical practice.
Vol. 706. AMA press Chicago.
Judith Haber. 2018. PART II Processes of Developing EBP and Questions in
Various Clinical Settings. Evidence-Based Practice for Nursing and Healthcare
Quality Improvement-E-Book (2018), 31.
C Harris and T Turner. 2011. Evidence-Based Answers to Clinical Questions for
Busy Clinicians. In Centre for Clinical Eectiveness. Monash Health, 1–32.
Andreas Holzinger, Bernd Malle, Peter Kieseberg, Peter M. Roth, Heimo Müller,
Robert Reihs, and Kurt Zatloukal. 2017. Towards the Augmented Pathologist:
Challenges of Explainable-AI in Digital Pathology. CoRR abs/1712.06657 (2017).
Julie A Jacobs, Elizabeth A Dodson, Elizabeth A Baker, Anjali D Deshpande, and
Ross C Brownson. 2010. Barriers to evidence-based decision making in public
health: a national survey of chronic disease practitioners. Public Health Reports
125, 5 (2010), 736–742.
Maia Jacobs, Jerey He, Melanie F. Pradier, Barbara Lam, Andrew C Ahn,
Thomas H McCoy, Roy H Perlis, Finale Doshi-Velez, and Krzysztof Z Gajos.
2021. Designing AI for trust and collaboration in time-constrained medical deci-
sions: a sociotechnical lens. In Proceedings of the 2021 CHI Conference on Human
Factors in Computing Systems. 1–14.
Harmanpreet Kaur, Eytan Adar, Eric Gilbert, and Cli Lampe. 2022. Sensible
AI: Re-Imagining Interpretability and Explainability Using Sensemaking Theory.
In 2022 ACM Conference on Fairness, Accountability, and Transparency (Seoul,
Republic of Korea) (FAccT ’22). Association for Computing Machinery, New York,
NY, USA, 702–714.
Ajay Kohli and Saurabh Jha. 2018. Why CAD failed in mammography. Journal
of the American College of Radiology 15, 3 (2018), 535–537.
Michael Kronenfeld, Priscilla L Stephenson, Barbara Nail-Chiwetalu, Elizabeth M
Tweed, Eric L Sauers, Tamara C Valovich McLeod, Ruiling Guo, Henry Trahan,
Kristine M Alpi, Beth Hill, et al
2007. Review for librarians of evidence-based
practice in nursing and the allied health professions in the United States. Journal
of the Medical Library Association: JMLA 95, 4 (2007), 394.
Thao Le, Tim Miller, Ronal Singh, and Liz Sonenberg. 2022. Improving Model
Understanding and Trust with Counterfactual Explanations of Model Condence.
arXiv preprint arXiv:2206.02790 (2022).
Howard Lee and Yi-Ping Phoebe Chen. 2015. Image based computer aided
diagnosis system for cancer detection. Expert Systems with Applications 42, 12
(2015), 5356–5365.
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim,
Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language
representation model for biomedical text mining. Bioinformatics 36, 4 (2020),
Constance D Lehman, Robert D Wellman, Diana SM Buist, Karla Kerlikowske,
Anna NA Tosteson, and Diana L Miglioretti. 2015. Diagnostic accuracy of digital
screening mammography with and without computer-aided detection. JAMA
internal medicine 175, 11 (2015), 1828–1837.
Q.Vera Liao and S. Shyam Sundar. 2022. Designing for Responsible Trust
in AI Systems: A Communication Perspective. In 2022 ACM Conference on
Fairness, Accountability, and Transparency (Seoul, Republic of Korea) (FAccT
’22). Association for Computing Machinery, New York, NY, USA, 1257–1268.
Q. Vera Liao, Yunfeng Zhang, Ronny Luss, Finale Doshi-Velez, and Amit Dhurand-
har. 2022. Connecting Algorithmic Research and Usage Contexts: A Perspective
of Contextualized Evaluation for Explainable AI.
Youn-Kyung Lim, Erik Stolterman, and Josh Tenenberg. 2008. The anatomy of
prototypes: Prototypes as lters, prototypes as manifestations of design ideas.
ACM Transactions on Computer-Human Interaction (TOCHI) 15, 2 (2008), 1–27.
[45] Tuuli Mattelmäki et al. 2006. Design probes. Aalto University.
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman,
Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019.
Model cards for model reporting. In Proceedings of the conference on fairness,
accountability, and transparency. 220–229.
Bill Moggridge and Bill Atkinson. 2007. Designing interactions. Vol. 17. MI T press
Cambridge, MA.
M Hassan Murad, Noor Asi, Mouaz Alsawas, and Fares Alahdab. 2016. New
evidence pyramid. BMJ Evidence-Based Medicine 21, 4 (2016), 125–127.
Eva Nourbakhsh, Rebecca Nugent, Helen Wang, Cihan Cevik, and Kenneth
Nugent. 2012. Medical literature searches: a comparison of P ub M ed and G
oogle S cholar. Health Information & Libraries Journal 29, 3 (2012), 214–222.
Benjamin E. Nye, Ani Nenkova, Iain J. Marshall, and Byron C. Wallace. [n.d.].
Trialstreamer: Mapping and Browsing Medical Evidence in Real-Time. ([n. d.]).
Cecilia Panigutti, Andrea Beretta, Fosca Giannotti, and Dino Pedreschi. 2022.
Understanding the Impact of Explanations on Advice-Taking: A User Study
CHI ’23, April 23–28, 2023, Hamburg, Germany Yang et al.
for AI-Based Clinical Decision Support Systems. In Proceedings of the 2022 CHI
Conference on Human Factors in Computing Systems (New Orleans, LA, USA)
(CHI ’22). Association for Computing Machinery, New York, NY, USA, Article
568, 9 pages.
Kate Radclie, Helena C Lyson, Jill Barr-Walker, and Urmimala Sarkar. 2019.
Collective intelligence in medical decision-making: a systematic scoping review.
BMC medical informatics and decision making 19, 1 (2019), 1–11.
Pranav Rajpurkar, Jeremy A. Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta,
Tony Duan, Daisy Yi Ding, Aarti Bagul, C. Langlotz, Katie S. Shpanskaya,
Matthew P. Lungren, and A. Ng. 2017. CheXNet: Radiologist-Level Pneumonia
Detection on Chest X-Rays with Deep Learning. ArXiv abs/1711.05225 (2017).
Amy Rechkemmer and Ming Yin. 2022. When Condence Meets Accuracy:
Exploring the Eects of Multiple Performance Indicators on Trust in Machine
Learning Models. In Proceedings of the 2022 CHI Conference on Human Factors in
Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing
Machinery, New York, NY, USA, Article 535, 14 pages.
Anthony L Rosner. 2012. Evidence-based medicine: revisiting the pyramid of
priorities. Journal of Bodywork and Movement Therapies 16, 1 (2012), 42–49.
Erika Rovini, Carlo Maremmani, and Filippo Cavallo. 2017. How wearable sensors
can support Parkinson’s disease diagnosis and treatment: a systematic review.
Frontiers in neuroscience 11 (2017), 555.
David L Sackett, William MC Rosenberg, JA Muir Gray, R Brian Haynes, and
W Scott Richardson. 1996. Evidence based medicine: what it is and what it isn’t.
, 71–72 pages.
Fernando Suarez Saiz, Corey Sanders, Rick Stevens, Robert Nielsen, Michael Britt,
Leemor Yuravlivker, Anita M Preininger, and Gretchen P Jackson. 2021. Articial
Intelligence Clinical Evidence Engine for Automatic Identication, Prioritization,
and Extraction of Relevant Clinical Oncology Research. JCO Clinical Cancer
Informatics 5 (2021), 102–111.
Mike Schaekermann, Carrie J. Cai, Abigail E. Huang, and Rory Sayres. 2020.
Expert Discussions Improve Comprehension of Dicult Cases in Medical Image
Assessment. Association for Computing Machinery, New York, NY, USA, 1–13.
Andrew D Selbst and Solon Barocas. 2018. The intuitive appeal of explainable
machines. Fordham L. Rev. 87 (2018), 1085.
Mark Sendak, Madeleine Clare Elish, Michael Gao, Joseph Futoma, William
Ratli, Marshall Nichols, Armando Bedoya, Suresh Balu, and Cara O’Brien. 2020.
" The human body is a black box" supporting clinical decision-making with deep
learning. In Proceedings of the 2020 conference on fairness, accountability, and
transparency. 99–109.
Salimah Z Shari, Shayna AD Bejaimal, Jessica M Sontrop, Arthur V Iansavichus,
R Brian Haynes, Matthew A Weir, and Amit X Garg. 2013. Retrieving clinical
evidence: a comparison of PubMed and Google Scholar for quick clinical searches.
Journal of medical Internet research 15, 8 (2013), e2624.
Michael Simmons, Ayush Singhal, and Zhiyong Lu. 2016. Text mining for preci-
sion medicine: bringing structure to EHRs and biomedical literature to understand
genes and health. Translational Biomedical Informatics (2016), 139–166.
Marion K Slack and Jolaine R Draugalis Jr. 2001. Establishing the internal and
external validity of experimental studies. American journal of health-system
pharmacy 58, 22 (2001), 2173–2181.
Richard Smith. 1996. What clinical information do doctors need? Bmj 313, 7064
(1996), 1062–1068.
Emily Sullivan. 2020. Understanding from machine learning models. The British
Journal for the Philosophy of Science (2020).
Reed T Sutton, David Pincock, Daniel C Baumgart, Daniel C Sadowski, Richard N
Fedorak, and Karen I Kroeker. 2020. An overview of clinical decision support
systems: benets, risks, and strategies for success. NPJ digital medicine 3, 1 (2020),
Audrey Tan, Mark Durbin, Frank R Chung, Ada L Rubin, Allison M Cuthel,
Jordan A McQuilkin, Aram S Modrek, Catherine Jamin, Nicholas Gavin, Devin
Mann, et al
2020. Design and implementation of a clinical decision support tool
for primary palliative Care for Emergency Medicine (PRIM-ER). BMC medical
informatics and decision making 20, 1 (2020), 1–11.
Myriam Tanguay-Sela, David Benrimoh, Christina Popescu, Tamara Perez,
Colleen Rollins, Emily Snook, Eryn Lundrigan, Caitrin Armstrong, Kelly Perlman,
Robert Fratila, Joseph Mehltretter, Sonia Israel, Monique Champagne, Jérôme
Williams, Jade Simard, Sagar V. Parikh, Jordan F. Karp, Katherine Heller, Outi
Linnaranta, Liliana Gomez Cardona, Gustavo Turecki, and Howard C. Margolese.
2022. Evaluating the perceived utility of an articial intelligence-powered clini-
cal decision support system for depression treatment using a simulation center.
Psychiatry Research 308 (2022), 114336.
Anja Thieme, Maryann Hanratty, Maria Lyons, Jorge E Palacios, Rita Marques,
Cecily Morrison, and Gavin Doherty. 2022. Designing Human-Centered AI for
Mental Health: Developing Clinically Relevant Applications for Online CBT
Treatment. ACM Transactions on Computer-Human Interaction (2022).
Simon M. Thomas, James G. Lefevre, Glenn Baxter, and Nicholas A. Hamilton.
2021. Interpretable deep learning systems for multi-class segmentation and
classication of non-melanoma skin cancer. Medical Image Analysis 68 (2021),
Sahil Verma, John Dickerson, and Keegan Hines. 2020. Counterfactual explana-
tions for machine learning: A review. arXiv preprint arXiv:2010.10596 (2020).
Danding Wang, Qian Yang, Ashraf Abdul, and Brian Y. Lim. 2019. Designing
Theory-Driven User-Centric Explainable AI. In Proceedings of the 2019 CHI Con-
ference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI
’19). ACM, New York, NY, USA, Article 601, 15 pages.
Yao Xie, Melody Chen, David Kao, Ge Gao, and Xiang’Anthony’ Chen. 2020.
CheXplain: Enabling Physicians to Explore and Understand Data-Driven, AI-
Enabled Medical Imaging Analysis. In Proceedings of the 2020 CHI Conference on
Human Factors in Computing Systems. 1–13.
Qian Yang, AaronSteinfeld, and John Zimmerman. 2019. Unremarkable AI: Fitting
Intelligent Decision Support into Critical, Clinical Decision-Making Processes. In
Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems
(Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New
York, NY, USA, 1–11.
Qian Yang, John Zimmerman, Aaron Steinfeld, Lisa Carey, and James F. Antaki.
2016. Investigating the Heart Pump Implant Decision Process: Opportunities
for Decision Support Tools to Help. In Proceedings of the 2016 CHI Conference on
Human Factors in Computing Systems (San Jose, California, USA) (CHI ’16). ACM,
New York, NY, USA, 4477–4488.
... For a given input, different decision-makers may benefit from different forms of support. For example, one radiologist may provide a better diagnosis of a chest X-ray by leveraging model predictions [2] while another may perform better after viewing suggestions from senior radiologists [3] or viewing a summary of relevant medical records from a large language model (LLM) [4,5,6,7]. An individual decision-maker may also need a different form of support for different inputs. ...
... Using the population-level data per Appendix D.4, we select the (1) most likely λ and (2) the most 7 Due to a server-side glitch, 6 of the 125 recruited participants received incorrect feedback on ≤ 2% of trials. 8 We explored a 3 class variant for HSEs to directly match the computational experiments; however, we realized that participants were able to figure out that which classes were impoverished, raising the base rate of correctly categorizing such images. ...
... On the CIFAR-2A dataset, we explore the effect of varying the embedding size for KNN, with D Additional User Study DetailsD.1 Additional Recruitment DetailsWithin a task, participants are randomly assigned to an algorithm variant; an equal number of participants are included per variant (i.e., 10 for MMLU and 5 for CIFAR)7 . Participants are required to reside in the United States and speak English as a first-language. ...
Full-text available
Individual human decision-makers may benefit from different forms of support to improve decision outcomes. However, a key question is which form of support will lead to accurate decisions at a low cost. In this work, we propose learning a decision support policy that, for a given input, chooses which form of support, if any, to provide. We consider decision-makers for whom we have no prior information and formalize learning their respective policies as a multi-objective optimization problem that trades off accuracy and cost. Using techniques from stochastic contextual bandits, we propose $\texttt{THREAD}$, an online algorithm to personalize a decision support policy for each decision-maker, and devise a hyper-parameter tuning strategy to identify a cost-performance trade-off using simulated human behavior. We provide computational experiments to demonstrate the benefits of $\texttt{THREAD}$ compared to offline baselines. We then introduce $\texttt{Modiste}$, an interactive tool that provides $\texttt{THREAD}$ with an interface. We conduct human subject experiments to show how $\texttt{Modiste}$ learns policies personalized to each decision-maker and discuss the nuances of learning decision support policies online for real users.
... AI and deep learning are based on algorithms developed initially by clinicians who have a perception or model for practice which will include flaws and biases [6]. Careful calibration, validation and monitoring are required to ensure that the modelling is accurate and meaningful [7]. Clinical decision-making is multifaceted and involves the collection, analysis of data from varied sources which is evaluated against the known outcomes from research and clinical practice to enable 'actionable decisions' [8] [9] [10]. ...
Full-text available
The use of Artificial Intelligence (AI) for clinical pathway management and decision making is believed to improve clinical care and has been used to improve pathways for treatment in most medical disciplines. Methods: A literature review was undertaken to identify the hurdles and steps required to introduce supported clinical decision-making using AI within hospitals. This was supported by a survey of local hospital practice within the Mid-lands of the United Kingdom to see what systems had been introduced and were functioning effectively. Results: It is unclear how to practically implement systems using AI within medicine easily. Algorithmic medicine based on a set of rules calculated from data only takes a clinician so far to deliver patient centred optimal treatment. AI facilitates a clinician's ability to assimilate data from disparate sources and can help with some of the analysis and decision making. However, learning remains organic and the subtleties of difference between patients, care providers who exhibit non-verbal communication for instance make it difficult for an AI to capture all the pertinent information required to make the correct clinical decision for any given individual. Hence it assists rather than controls any process in clinical practice. It also must continually renew and adapt considering changes in practise and trends as the goalposts change to meet fluctuations in resources and workload. Precision surgery is benefiting from robotic-assisted surgery in parts driven by AI and being used in 80% of trusts locally. Conclusion: The use of AI in clinical practice remains patchy with it being adopted where research groups have studied a more effective method of How to cite this paper: Capes, N., monitoring or treatment. The use of robotic-assisted surgery on the other hand has been more rapid as the precision of treatment that this provides appears attractive in improving clinical care.
... As a result, designing and developing human-centered explanations has become a core theme in computer-supported cooperative work (CSCW) and human-AI collaboration research [24,48,81]. Tangentially, recent work has proposed new explanation techniques that leverage machine learning models to explain the prediction of another machine learning model [7,12,32,38,65,73,77,82,88]. A subset of these studies propose to exploit language models to generate natural language explanations for image classifications with the rationalization that natural language is more "human-friendly" [32,38,73]. ...
Full-text available
Explainability techniques are rapidly being developed to improve human-AI decision-making across various cooperative work settings. Consequently, previous research has evaluated how decision-makers collaborate with imperfect AI by investigating appropriate reliance and task performance with the aim of designing more human-centered computer-supported collaborative tools. Several human-centered explainable AI (XAI) techniques have been proposed in hopes of improving decision-makers' collaboration with AI; however, these techniques are grounded in findings from previous studies that primarily focus on the impact of incorrect AI advice. Few studies acknowledge the possibility for the explanations to be incorrect even if the AI advice is correct. Thus, it is crucial to understand how imperfect XAI affects human-AI decision-making. In this work, we contribute a robust, mixed-methods user study with 136 participants to evaluate how incorrect explanations influence humans' decision-making behavior in a bird species identification task taking into account their level of expertise and an explanation's level of assertiveness. Our findings reveal the influence of imperfect XAI and humans' level of expertise on their reliance on AI and human-AI team performance. We also discuss how explanations can deceive decision-makers during human-AI collaboration. Hence, we shed light on the impacts of imperfect XAI in the field of computer-supported cooperative work and provide guidelines for designers of human-AI collaboration systems.
... " In other cases, the human annotator completely missed important challenges that GPT-4 picked up. For instance, for [82], the AI included the challenge "mitigating risks of de-skilling clinicians in the use of AI" which the human list does not mention. ...
Large language models (LLMs), such as ChatGPT and GPT-4, are gaining wide-spread real world use. Yet, the two LLMs are closed source, and little is known about the LLMs' performance in real-world use cases. In academia, LLM performance is often measured on benchmarks which may have leaked into ChatGPT's and GPT-4's training data. In this paper, we apply and evaluate ChatGPT and GPT-4 for the real-world task of cost-efficient extractive question answering over a text corpus that was published after the two LLMs completed training. More specifically, we extract research challenges for researchers in the field of HCI from the proceedings of the 2023 Conference on Human Factors in Computing Systems (CHI). We critically evaluate the LLMs on this practical task and conclude that the combination of ChatGPT and GPT-4 makes an excellent cost-efficient means for analyzing a text corpus at scale. Cost-efficiency is key for prototyping research ideas and analyzing text corpora from different perspectives, with implications for applying LLMs in academia and practice. For researchers in HCI, we contribute an interactive visualization of 4392 research challenges in over 90 research topics. We share this visualization and the dataset in the spirit of open science.
... Our view is not that the community should take a monolithic standard on what constitutes LLM explanations, but rather must articulate what different types of explanations are, along with their suitable contexts, limitations, and pitfalls. For example, justifications, when provided truthfully, can supply useful additional information for information seekers [194]. In philosophy, the social sciences, and HCI, there is a long tradition of breaking down different types of explanations by their mechanism, stance, and the questions that they answer (e.g., what, how, why, why not, what if) [62,72,91,117,126,127]. ...
Full-text available
The rise of powerful large language models (LLMs) brings about tremendous opportunities for innovation but also looming risks for individuals and society at large. We have reached a pivotal moment for ensuring that LLMs and LLM-infused applications are developed and deployed responsibly. However, a central pillar of responsible AI -- transparency -- is largely missing from the current discourse around LLMs. It is paramount to pursue new approaches to provide transparency for LLMs, and years of research at the intersection of AI and human-computer interaction (HCI) highlight that we must do so with a human-centered perspective: Transparency is fundamentally about supporting appropriate human understanding, and this understanding is sought by different stakeholders with different goals in different contexts. In this new era of LLMs, we must develop and design approaches to transparency by considering the needs of stakeholders in the emerging LLM ecosystem, the novel types of LLM-infused applications being built, and the new usage patterns and challenges around LLMs, all while building on lessons learned about how people process, interact with, and make use of information. We reflect on the unique challenges that arise in providing transparency for LLMs, along with lessons learned from HCI and responsible AI research that has taken a human-centered perspective on AI transparency. We then lay out four common approaches that the community has taken to achieve transparency -- model reporting, publishing evaluation results, providing explanations, and communicating uncertainty -- and call out open questions around how these approaches may or may not be applied to LLMs. We hope this provides a starting point for discussion and a useful roadmap for future research.
... Notably, we do not aim to settle on or evaluate the effectiveness of any specific design-rather, we want to understand how UI design concepts could facilitate appropriate user trust through nuanced qualitative exploration. A similar approach has been used to explore interface designs for AI-assisted decision-making systems in child welfare [36] and clinical diagnosis [65]. ...
Full-text available
As AI-powered code generation tools such as GitHub Copilot become popular, it is crucial to understand software developers' trust in AI tools -- a key factor for tool adoption and responsible usage. However, we know little about how developers build trust with AI, nor do we understand how to design the interface of generative AI systems to facilitate their appropriate levels of trust. In this paper, we describe findings from a two-stage qualitative investigation. We first interviewed 17 developers to contextualize their notions of trust and understand their challenges in building appropriate trust in AI code generation tools. We surfaced three main challenges -- including building appropriate expectations, configuring AI tools, and validating AI suggestions. To address these challenges, we conducted a design probe study in the second stage to explore design concepts that support developers' trust-building process by 1) communicating AI performance to help users set proper expectations, 2) allowing users to configure AI by setting and adjusting preferences, and 3) offering indicators of model mechanism to support evaluation of AI suggestions. We gathered developers' feedback on how these design concepts can help them build appropriate trust in AI-powered code generation tools, as well as potential risks in design. These findings inform our proposed design recommendations on how to design for trust in AI-powered code generation tools.
Full-text available
Chronic low back pain (LBP) is influenced by a broad spectrum of patient‐specific factors as codified in domains of the biopsychosocial model (BSM). Operationalizing the BSM into research and clinical care is challenging because most investigators work in silos that concentrate on only one or two BSM domains. Furthermore, the expanding, multidisciplinary nature of BSM research creates practical limitations as to how individual investigators integrate current data into their processes of generating impactful hypotheses. The rapidly advancing field of artificial intelligence (AI) is providing new tools for organizing knowledge, but the practical aspects for how AI may advance LBP research and clinical are beginning to be explored. The goals of the work presented here are to: (1) explore the current capabilities of knowledge integration technologies (large language models (LLM), similarity graphs (SGs), and knowledge graphs (KGs)) to synthesize biomedical literature and depict multimodal relationships reflected in the BSM, and; (2) highlight limitations, implementation details, and future areas of research to improve performance. We demonstrate preliminary evidence that LLMs, like GPT‐3, may be useful in helping scientists analyze and distinguish cLBP publications across multiple BSM domains and determine the degree to which the literature supports or contradicts emergent hypotheses. We show that SG representations and KGs enable exploring LBP's literature in novel ways, possibly providing, trans‐disciplinary perspectives or insights that are currently difficult, if not infeasible to achieve. The SG approach is automated, simple, and inexpensive to execute, and thereby may be useful for early‐phase literature and narrative explorations beyond one's areas of expertise. Likewise, we show that KGs can be constructed using automated pipelines, queried to provide semantic information, and analyzed to explore trans‐domain linkages. The examples presented support the feasibility for LBP‐tailored AI protocols to organize knowledge and support developing and refining trans‐domain hypotheses.
As new forms of data capture emerge to power new AI applications, questions abound about the ethical implications of these data collection practices. In this paper, we present clinicians' perspectives on the prospective benefits and harms of voice data collection during health consultations. Such data collection is being proposed as a means to power models to assist clinicians with medical data entry, administrative tasks, and consultation analysis. Yet, clinicians' attitudes and concerns are largely absent from the AI narratives surrounding these use cases, and the academic literature investigating them. Our qualitative interview study used the concept of an informed consent process as a type of design fiction, to support elicitation of clinicians' perspectives on voice data collection and use associated with a fictional, near-term AI assistant. Through reflexive thematic analysis of in-depth sessions with physicians, we distilled eight classes of potential risks that clinicians are concerned about, including workflow disruptions, self-censorship, and errors that could impact patient eligibility for services. We conclude with an in-depth discussion of these prospective risks, reflect on the use of the speculative processes that illuminated them, and reconsider evaluation criteria for AI-assisted clinical documentation technologies in light of our findings.
Full-text available
Results from Randomized Controlled Trials (RCTs) establish the comparative effectiveness of interventions, and are in turn critical inputs for evidence-based care. However, results from RCTs are presented in (often unstructured) natural language articles describing the design, execution, and outcomes of trials; clinicians must manually extract findings pertaining to interventions and outcomes of interest from such articles. This onerous manual process has motivated work on (semi-)automating extraction of structured evidence from trial reports. In this work we propose and evaluate a text-to-text model built on instruction-tuned Large Language Models (LLMs) to jointly extract Interventions, Outcomes, and Comparators (ICO elements) from clinical abstracts, and infer the associated results reported. Manual (expert) and automated evaluations indicate that framing evidence extraction as a conditional generation task and fine-tuning LLMs for this purpose realizes considerable ($\sim$20 point absolute F1 score) gains over the previous SOTA. We perform ablations and error analyses to assess aspects that contribute to model performance, and to highlight potential directions for further improvements. We apply our model to a collection of published RCTs through mid-2022, and release a searchable database of structured findings (anonymously for now):
Full-text available
Recent years have seen a surge of interest in the field of explainable AI (XAI), with a plethora of algorithms proposed in the literature. However, a lack of consensus on how to evaluate XAI hinders the advancement of the field. We highlight that XAI is not a monolithic set of technologies---researchers and practitioners have begun to leverage XAI algorithms to build XAI systems that serve different usage contexts, such as model debugging and decision-support. Algorithmic research of XAI, however, often does not account for these diverse downstream usage contexts, resulting in limited effectiveness or even unintended consequences for actual users, as well as difficulties for practitioners to make technical choices. We argue that one way to close the gap is to develop evaluation methods that account for different user requirements in these usage contexts. Towards this goal, we introduce a perspective of contextualized XAI evaluation by considering the relative importance of XAI evaluation criteria for prototypical usage contexts of XAI. To explore the context dependency of XAI evaluation criteria, we conduct two survey studies, one with XAI topical experts and another with crowd workers. Our results urge for responsible AI research with usage-informed evaluation practices, and provide a nuanced understanding of user requirements for XAI in different usage contexts.
Full-text available
The black-box nature of current artificial intelligence (AI) has caused some to question whether AI must be explainable to be used in high-stakes scenarios such as medicine. It has been argued that explainable AI will engender trust with the health-care workforce, provide transparency into the AI decision making process, and potentially mitigate various kinds of bias. In this Viewpoint, we argue that this argument represents a false hope for explainable AI and that current explainability methods are unlikely to achieve these goals for patient-level decision support. We provide an overview of current explainability techniques and highlight how various failure cases can cause problems for decision making for individual patients. In the absence of suitable explainability methods, we advocate for rigorous internal and external validation of AI models as a more direct means of achieving the goals often associated with explainability, and we caution against having explainability be a requirement for clinically deployed models.
We take inspiration from the study of human explanation to inform the design and evaluation of interpretability methods in machine learning. First, we survey the literature on human explanation in philosophy, cognitive science, and the social sciences, and propose a list of design principles for machine-generated explanations that are meaningful to humans. Using the concept of weight of evidence from information theory, we develop a method for generating explanations that adhere to these principles. We show that this method can be adapted to handle high-dimensional, multi-class settings, yielding a flexible framework for generating explanations. We demonstrate that these explanations can be estimated accurately from finite samples and are robust to small perturbations of the inputs. We also evaluate our method through a qualitative user study with machine learning practitioners, where we observe that the resulting explanations are usable despite some participants struggling with background concepts like prior class probabilities. Finally, we conclude by surfacing design implications for interpretability tools in general.
Recent AI advances in AI and machine learning (ML) promise significant transformations in the future delivery of healthcare. Despite a surge in research and development, few works have moved beyond demonstrations of technical feasibility and algorithmic performance. However, to realize many of the ambitious visions for how AI can contribute to clinical impact requires the closer design and study of AI tools or interventions within specific health and care contexts. This paper outlines our collaborative, human-centered approach to developing an AI application that predicts treatment outcomes for patients who are receiving human-supported, internet-delivered Cognitive Behavioral Therapy (iCBT) for symptoms of depression and anxiety. Intersecting the fields of HCI, AI and healthcare, we describe how we addressed the specific challenges of: (1) identifying clinically relevant AI applications ; and (2) designing AI applications for sensitive use contexts like mental health. Aiming to better assist the work practices of iCBT supporters, we share how learnings from an interview study with 15 iCBT supporters surfaced their practices and information needs, and revealed new opportunities for the use of AI. Combined with insights from the clinical literature and technical feasibility constraints, this led to the development of two clinical outcome prediction models. To clarify their potential utility for use in practice, we conducted 13 design sessions with iCBT supporters that utilized interface mock-ups to concretize the AI output and derive additional design requirements. Our findings demonstrate how design choices can impact interpretations of the AI predictions as well as supporter motivation and sense of agency. We detail how this analysis and the design principles derived from it enabled the integration of the prediction models into a production interface. Reporting on identified risks of over-reliance on AI outputs and needs for balanced information assessment and preservation of a focus on individualized care, we discuss and reflect on what constitutes a responsible, human-centered approach to AI design in this healthcare context.
Introduction: Evidence Based Medicine definitely challenges the traditional role of the Librarian thus advocating for the need to acquire necessary skills, so as to be in a position to support Evidence Based Medicine. Objectives: To outline the steps in the Evidence Based Medicine process; to explore the role of the health sciences librarian in the Evidence Based Medicine process; and to explore the challenges and opportunities that health sciences Librarians encounter. Methodology: The study was descriptive and qualitative summary of the data was provided. The study obtained data from the views and experiences of 20 medical librarians from various countries during an online distance education course. Conclusion and implication: Despite the various challenges encountered, the role of the health sciences librarian is acknowledged as critical in supporting Evidence Based Medicine.
Aifred is a clinical decision support system (CDSS) that uses artificial intelligence to assist physicians in selecting treatments for major depressive disorder (MDD) by providing probabilities of remission for different treatment options based on patient characteristics. We evaluated the utility of the CDSS as perceived by physicians participating in simulated clinical interactions. Twenty psychiatry and family medicine staff and residents completed a study in which each physician had three 10-minute clinical interactions with standardized patients portraying mild, moderate, and severe episodes of MDD. During these scenarios, physicians were given access to the CDSS, which they could use in their treatment decisions. The perceived utility of the CDSS was assessed through self-report questionnaires, scenario observations, and interviews. 60% of physicians perceived the CDSS to be a useful tool in their treatment-selection process, with family physicians perceiving the greatest utility. Moreover, 50% of physicians would use the tool for all patients with depression, with an additional 35% noting they would reserve the tool for more severe or treatment-resistant patients. Furthermore, clinicians found the tool to be useful in discussing treatment options with patients. The efficacy of this CDSS and its potential to improve treatment outcomes must be further evaluated in clinical trials.