ArticlePDF Available

Abstract and Figures

A plethora of work has shown that AI systems can systematically and unfairly be biased against certain populations in multiple scenarios. The field of medical imaging, where AI systems are beginning to be increasingly adopted, is no exception. Here we discuss the meaning of fairness in this area and comment on the potential sources of biases, as well as the strategies available to mitigate them. Finally, we analyze the current state of the field, identifying strengths and highlighting areas of vacancy, challenges and opportunities that lie ahead.
Content may be subject to copyright.
Comment
https://doi.org/10.1038/s41467-022-32186-3
Addressing fairness in articial intelligence for
medical imaging
María Agustina Ricci Lara, Rodrigo Echeveste and Enzo Ferrante Check for updates
A plethora of work has shown that AI systems
can systematically and unfairly be biased against
certain populations in multiple scenarios. The
eld of medical imaging, where AI systems are
beginning to be increasingly adopted, is no
exception. Here we discuss the meaning of fair-
ness in this area and comment on the potential
sources of biases, as well as the strategies avail-
able to mitigate them. Finally, we analyze the
current state of the eld, identifying strengths
and highlighting areas of vacancy, challenges
and opportunities that lie ahead.
With the exponential growth in the development of articial intelli-
gence (AI) systems for the analysis of medical images, hospitals and
medical centers have started to deploy such tools in clinical practice1.
These systems are typically powered by a particular type of machine
learning (ML) technique known as deep learning (DL). DL methods
learn complex data representations by employing multiple layers of
processing with different levels of abstraction, which are useful to
solve a wide spectrum of tasks. In the context of medical image com-
puting (MIC), examples of such tasks include pathology classication,
anatomical segmentation, lesion delineation, image reconstruction,
synthesis, registration and super-resolution, among many others2.
While the number of scientic publications related to DL methods
applied to different MIC problems in laboratory conditions has grown
exponentially, clinical trials aimed at evaluating medical AI systems
have only recently started to gain momentum. In fact, according to the
American College of Radiology, to date less than 200 AI medical pro-
ducts related to radiology and other imaging domains have been
cleared by the United States Food and Drug Administration3.
Recently, the research community of fairness in ML has high-
lighted that ML systems can be biased against certain sub-populations,
in the sense that they present disparate performancefor different sub-
groups dened by protected attributes such as age, race/ethnicity, sex
or gender, socioeconomic status, among others4,5.
In the eld of healthcare, the potential unequal behavior of
algorithms towards different population sub-groups could even be
considered to go against the principles of bioethics: justice, autonomy,
benecence and non-malecence6. In this context, fostering fairness in
MIC becomes essential. However, this is far from being a simple task:
ensuring equity in ML deployments requires tackling different and
multiple aspects along the whole design, development and imple-
mentation pathway. While the implications of fairness in ML for the
broad eld of healthcare have recently been surveyed and discussed7,
in this comment we focus on the sub-eld of medical imaging. Indeed,
when it comes to biases in ML systems that can benetcertainsub-
populations in detriment of others, the eld of medical imaging is not
the exception8,9. In whatfollows we willcomment on recent work inthe
eld and highlight valuable unexplored areas of research, discussing
potential challenges and available strategies.
What does it mean for an algorithm to be fair?
Let us start by considering this question in the context of patient sub-
groups dened by skin tone or race/ethnicity, where a number of
recent articles have compared the performance of MIC systems for
suspected ophthalmologic, thoracic and/or cardiac pathologies. For
example, when it comes to diagnosing diabetic retinopathy, a severe
imbalance in the data used to train a model may result in a strong gap
in the diagnostic accuracy (73% vs. 60.5%) for light-skinned vs. dark-
skinned subjects10. In the same vein, it has been detected that models
fed with chest radiography for pathology classication have a higher
rate of underdiagnosis for under-served sub-populations, including
Black patients9, so that the use of these tools could increase the
probability of those patients being sent home without receiving the
care they need. Lower performance of AI models designed for cardiac
MRI segmentation (in terms of Dice coefcient) in this group has also
been found11, which may result in compound biases if any further
diagnostic analysis were required to be done on the automatically
delineated silhouette.
After reading these examples, we immediately and automatically
recognize these situations as unfair. However, establishing a criterion
to determine whether an algorithm can be called fair is actually a
thorny issue. In the previous paragraph we have purposely mentioned
examples where different metrics were employed in each case. Indeed,
the rst issue one encounters is that a large number of candidate
measures exist. One can for instance evaluate fairness by comparing
standard ML performance metrics across different sub-groups, such as
accuracy10,1216,orAUCROC(theareaunderthereceiveroperating
characteristic curve)810,1422, among others. Alternatively, one can
choose to employ one of the (no less than ten) different fairness-
specic criteria formulated by the community23 in order to audit the
presence of bias in a given model16,18. To complicate matters further,
even if one carries out a multi-dimensional study by simultaneously
employing multiple metrics9,10,1416,20,21,24, which model to select at the
end in a given setting might be no trivial matter and additional infor-
mation will in general be required. Along these lines, on those occa-
sions when the prevalence of the target condition is different between
sub-groups (Fig. 1, top row), special care must be taken in the selection
of the fairness denition to be used25. For example, the demographic
parity criterion (Fig. 1, bottom row, right side) which requires equal
chances of positive predictions in each group, would here suggest the
nature communications (2022) 13:4581 | 1
1234567890():,;
1234567890():,;
algorithm is unfair for presenting a higher probability of a positive
result for the sub-group with a greater target condition prevalence.
This criterion assumes that the prediction of an algorithm is inde-
pendent of the protected attribute that denes each sub-group, so it
may be suitable in settings such as loan eligibility prediction or hiring
for job vacancies, but not for disease prediction cases where the pre-
valence depends on the aforementioned attribute. In these cases, it
wouldbemoreappropriatetoresorttodenitions such as the equal
opportunity criterion (Fig. 1, bottom row, right side), which will com-
pare the equality of true positive rates between sub-groups whose
computation is independent of the pre-test probability. Overall, it
becomes clear that a one-size-ts-all denition of fairness in MIC will
not exist.
Three reasons behind biased systems: data, models and
people
Providing effective solutions to disparities in the outcomes of AI sys-
tems starts by identifying which may be their underlying causes
(Fig. 2). The lack of diversity and proper representation of the target
population in the training databases has been identied as one of the
main reasons behind this phenomenon4(Fig. 2). In the context of
MIC, ML systems are trained using big databases of images, usually
accompanied by annotations or labels indicating the desired output
that we expect from the system (e.g., X-ray images with labels asso-
ciated with the radiological nding of interest like pneumonia or car-
diomegaly). When the demographics of such databases do not match
that of the target population, the trained model may be biased, pre-
senting lower performance in the underrepresented groups11.Indeed,
in chest X-ray pathology classication, only few of the major available
datasets in that domain include information about race/ethnicity and,
in cases where this information is included, databases tend to be
skewed in terms of those attributes26.
One point to keep in mind is that a ML system violating one
particular denition of fairness should not necessarily be considered
biased. In this sense, the selection of appropriate metrics to assess
and ensure fairness according to the specic use case is a delicate
task that requires careful human intervention. Moreover, such a
choice will also be conditioned by the fact that some of these
metrics are mutually exclusive27,implyingthat,forexample,building
a classier to be simultaneously maximally fair in terms of outcomes,
opportunities and calibration will not be feasible most of the time. In
addition, other choices related to model design, such as the archi-
tecture, loss function, optimizer or even hyper-parameters, may
also play a fundamental role in bias amplication or mitigation28
(Fig. 2). The same happens with sampling criteria for database con-
struction. For the above reasons, if decisions are made exclusively by
developers, engineers, medical specialists, or data scientists in isola-
tion, or by groups of people who share the same ethnic or social
background, there is a risk that their own biases may be unintentio nally
incorporated into the system based on what they choose to prior-
itize (Fig. 2).
Taking a step back, complex structural reasons for bias need also
be taken into account. We highlight some of these here (see ref. 7for
an in depth analysis). Unequal treatment of patients, as well as dis-
parate access to the healthcare system due to economic inequalities
conspires against investigating certain pathologies in under-
represented populations. Anatomical differences and even variability
in the manifestation of diseases across sub-groups can moreover act as
confounders. Likewise, many health problems of particular relevance
to low income countries areoften understudied due to lack ofresearch
funding in those countries. Finally, whileauditing systems for potential
biases, people may unintentionally only search within the possibilities
and the reality with which they are familiar.
Fig. 1 | Group-fairness metrics. Here we include a toy-example in the context of
disease classication, where two sub-populations characterized by different pro-
tected attributes (in red and blue) present different disease prevalence (40% and
20% for blue and red subjects respectively, top row, xmarks positive cases). A
model optimized for discriminative performance was assessed on a test set
achieving 100% accuracy (bottom row left side, +marks positive predictions).
Algorithm fairness was audited according to two common metric choices (bottom
row, right side). In this case, as a consequence of the difference in disease fre-
quency, the model would not fulll the demographic parity criterion (bottom row,
right side) since the positive prediction rates vary between sub-groups : 40% (8
positive predictions over 20 cases) for the blue sub-group vs. 20% (4 positive
predictions over 20 cases) for the red sub-group. On the other hand, the model
would fulll the equal opportunity criterion, as true positive rates match for both
sub-groups reaching the value of 100%: 8 true positives out of 8 positive ground
truth cases for the blue sub-group and 4 true positives out of 4 positive ground
truth cases for the red sub-group . FN false negatives, FP false positives, TN true
negatives, TP true positives. See legend-box with symbols on the top right corner.
Fig. 2 | Mainpotential sources of biasin AI systems forMIC. The data beingfed to
the systemduring training (1), design choicesfor the model (2),and the people who
develop those systems (3), may all contribute to biases in AI systems for MIC.
Comment
nature communications (2022) 13:4581 | 2
Bias mitigation strategies
Several studies in recent years have proposed solutions to mitigate
bias and develop fairer algorithms10,11,1417,19,20,24. There are three main
stages at which bias mitigation strategies can be adopted11: before,
during and after training. Before training, one would ideally seek to
rebalance datasets by collecting more representative data (Fig. 2).
However, in the medical context this is far from trivial as this process
requires patients giving consent to their data being used for research
purposes as well as the involvement of specialists analyzing each case
and providing ground truth labels. Moreover, the low prevalence of
certain conditions might hinder nding sufcient examples. In this
sense, a compromise solution involves removing features linked to
sensitive information, or the use of data resampling strategies.
During training, several alternatives exist to mitigate model biases
(Fig. 2), such as the use of data augmentation10,14,19 and adversarial
training17,20,24, with the combination of both having even been
employed15. The use of generative methods as a way to augment the
dataset, for instance, has proven effective in reducing the disparity in
the diagnostic accuracy of diabetic retinopathy between light-skinned
and dark-skinned individuals10. On the other hand, adversarial schemes
havebeenshowntoreducebiasesinskinlesionclassication24.Inthis
case, adversarial methods intend to increase the performance of a
primary model on the target variable while minimizing the ability of a
second (adversarial) model to predict theprotected attribute fromthe
features learned by the primary model23. Finally, after training,model
outcomes can be post-processed so as to calibrate the predictions
across the different sub-groups. These methods focus on the second
reason behind biased systems we mentioned before, namely models.
It must be noted, however, that methods designed to improve
algorithmic fairness may lead in practice to different outcomes. In the
best-case scenario, applying bias mitigation strategies increases the
performance ofthe algorithmfor all sub-groups14, posing no additional
constraints. At the other end of the spectrum, a reduction in the per-
formance for all sub-groups may result from trying to achieve
algorithmic fairness17. Indeed, interventions to achieve group fairness
may result in tensions with the primary goal of the algorithms,
requiring a compromise solution. This outcome poses a dilemma in
healthcare settings, since it could be interpreted to violate the prin-
ciples of bioethics, specically that of non-malecence. These two
extremes are however rare, and a frequent outcome observed in the
existing MIC fairness studies analyzed in this article, is performance
improvement for the disadvantaged group at the expense of a
reduction for another group or groups11. This trade-off is also not free
of controversies, and once again we nd ourselves in a situation where
the decision of what is acceptable in a given setting requires careful
human consideration. That is why, as discussed in the previous section,
diversity is key not only in terms of databases, but also in team com-
position (Fig. 2). Hence, considering participatory design practices
that explicitly incorporate perspectives from a diverse set of
stakeholders29 is a fundamental aspect to consider when dealing with
algorithmic bias.
Challenges and outlook for fairness studies in MIC
Even though the eld has been steadily growing over the past few
years, there are still challenges and open research questions that we
believe need to be addressed.
Areas of vacancy. While this growing trend is highly encouraging,
the efforts have been far from even across the landscape of medical
specialties and problems being tackled, leaving several areas of
vacancy. Firstly, so far algorithmic justice analysis has mostly been
carried out in four medical imaging specialties: radiology8,9,16,1822,
dermatology12,13,17,19,24,ophthalmology
10,14,15 andcardiology11. We believe
that this uneven coverage is partly due to the limited availability of MI
databases with demographic information on the population (Table 1),
something which has been highlighted in several previous studies8,17.
The absence of this information may be related to the trade-off
between data utility and privacy when releasing public databases, in
Table 1 | Databases commonly used in fairness in MIC studies
Image modality Database Access Sex or genderaAge Skin tone or race/
ethnicityb
SES
Chest X-ray CheXpert31 Public x x x
NIH Chest X-Ray32 Public x x ––
MIMIC Chest X-Ray33 Public x x x x
Emory University Hospital Chest X-Ray20 Private x x x
Mammography Digital Mammographic Imaging Screening Trial
(DMIST)34
Private x x x
Emory University Hospital Mammography20 Private x x x
Dermoscopy ISIC Challenge 2017/18/2035,36 Public x x ––
Dermatological clinical image Fitzpatrick 17k13 Public ––x
SD-19849 Public ––
Fundus image AREDS37 Public x x x
Kaggle EyePACS50 Public ––
Cardiac MRI UK Biobank38 Public x x x x
Pulmonary angiography CT Stanford University Medical Center16 Public x x x
aAccording to the WorldHealth Organization,sex refersto different biologicaland physiological characteristicsof males andfemales, whilegender refersto the sociallyconstructedcharacteristics of
women and mensuch as norms, roles and relationships of and between groups of women and men. Databases tend to report one or the other.
bWe includeboth the term race and ethnicity since the cited studies make use of both denominations.We group analyses across differentskin tones in this categoryas well. Race and ethnicity are
social constructs with complex and dynamic denitions (see ref. 47).
Comment
nature communications (2022) 13:4581 | 3
the sense that including sensitive attributes useful for bias audit may
go against the privacy of the individuals. To overcome these limita-
tions, the implementation of technical solutions to simultaneously
address the demands for data protection and utilization becomes
extremely important30. Moreover, it must be noted that the subset of
sensitive attributes either directly reported or estimated varies
from dataset to dataset. The currently most widely reported char-
acteristics are age and sex or gender16,20,3138, followed by skin tone
or race/ethnicity13,16,20,33,34,37,38, and to a lesser extent socioeconomic
characteristics33,38. In some cases, where protected attributes are not
available, estimates can be computed using image processing
methods12,13,15,19,24, and eventually manual labeling by professionals can
be used10,13. These strategies bring with them however an additional
level of complexity and subtlety in their implementation which can
limit reproducibility and comparison of results across sub-groups.
Secondly, important vacancies exist regarding the MIC task to be
tackled. The vast majority of studies conducted to date deal with
pathology classication tasks810,1222,24. The study of fairness in the
context of segmentation is however rare11, and those of regression,
registration, synthesis and super-resolution are rarer still, leaving
entire areas to be explored.
Incorporating fairness auditsas common practice in MICstudies.As
highlighted by a recent article17 which analyzed the common practices
when reporting results for diagnostic algorithms in one of the major
conferences on MIC, demographics are rarely mentioned, and dis-
aggregated results are infrequently discussed by scienticpublications
in this domain. This matter is also addressed by the FUTURE-AI
Guidelines39, which include principles and consensus recommenda-
tions for trustworthy AI in medical imaging, and not only focus on
fairness but also cover other fundamental dimensions like universality,
traceability, usability, robustness and explainability. In that sense, we
believe the FUTURE-AI guidelines may constitute a practical tool to
improve the publication practices of our community.
Increasing diversity in database construction. As researchers work-
ing in Latin America, we want to stress the importance of widening
geographic representation in the building of publicly available MI
datasets. It has been acknowledged by several studies that the vast
majority of MI databases employed for AI developments originate
from high income countries, mostly in Europe and North America4042.
This introduces a clear selection bias since the demographics of these
countries do not match that of other areas like Africa, Asia or Latin
America. This fact, combined with experimental studies suggesting
that race/ethnicity imbalance in MI databases may be one of the rea-
sons behind unequal performance11, calls for action towards building
truly international databases which include patients from low income
countries. Thisissue becomes even more relevant in the light of recent
ndings which conrm that AI can trivially predictprotected attributes
from medical images, even in a setting where clinical experts cannot
like race/ethnicity in chest X-ray26 and ancestry in histologic images43.
While this fact by itself does not immediately meanthat systems will be
biased, in combination with a greedy optimization scheme in a setting
with strong data imbalance, it may provide a direct vector for the
reproduction of pre-existing racial disparities.
In this regard, initiatives such as the All of Us Research Program,
which invite participants from different sub-groups in the United
States to create a more diverse health database, hope to promote and
improve biomedical research, as well as medical care44. Efforts such as
this one, currently focused on an individual country, could be repli-
cated and lay the groundwork for a collaborative enterprise that
transcends geographic barriers.
Rethinking fairness in the context of medical image analysis.For
some time now, research on fairness in ML has been carried out in
decision-making scenarios such as loan applications, hiring systems,
criminal behavior reexamination, among others23.However,theeld
of healthcare in general, and medical imaging in particular, exhibit
unique characteristics that require adapting the notion of fairness to
this context. Take chest X-ray images for example: particular diag-
nostic tasks could be easier in one sub-population than the other due
to anatomical differences45. How to ensure fairness across sub-
populations in this case is far from obvious.
Another example is that of existing bias mitigation strategies
which may result in reducing model performance for the majority, or
even all sub-populations, in exchange for reducing the variance across
them. This might be admissible in other contexts, but in the case of
healthcare this implies purposely deteriorating the quality of the pre-
dictions for a given sub-group, causing ethical and legal problems
related to the provision of alternative standards of care for different
sub-groups21. Moreover, how to dene such sub-groups is already an
open question: the group-fairness framework, usually applied in pro-
blems like loan granting or intended to deal with legal notions of anti-
discrimination, reinforces the idea that groups based on pre-specied
demographic attributes are well-denedconstructs that correspond to
a set of homogeneous populations29. However, certain attributes like
gender identity46,areuid constructs difcult to categorize which
requirerethinking this framework. Similar issues may arise when using
race or ethnicity47 as protected attributes to dene groups of analysis
and evaluate fairness metrics.
While some factors inuencing fairness and model performance
metrics such as target class imbalance are common to several ML
domains, others such as differences in disease prevalence across sub-
populations have to be carefully taken into consideration when it
comes to MIC. The same holds for the cognitive biases that may be
introduced by medical specialists when interpreting and annotating
imaging studies48. While AI has been postulated as a potential tool to
help out in reducing such biases, if not properly addressed, it could
also become a mean to amplify and perpetuate them.
Overall there is no denying that the nascent eld of fairness in ML
studies for MIC still presents important vacancies both in terms of
medical specialties and in terms of the types problems being tackled,
which will require increased efforts fromthe community. However, the
rapid growth of the eld, the development of new guidelines, and the
gain of attention reported here, are highly positive and encourage the
MIC community to increase its effort to contribute towards delivering
a more equitable standard of care.
María Agustina Ricci Lara
1,2
, Rodrigo Echeveste
3,4
&
Enzo Ferrante
3,4
1
Health Informatics Department, Hospital Italiano de Buenos Aires, Ciu-
dad Autónoma de Buenos Aires, Argentina.
2
Universidad Tecnológica
Nacional, Ciudad Autónoma de Buenos Aires, Argentina.
3
Research
Institute for Signals, Systems and Computational Intelligence sinc(i)
(FICH-UNL/CONICET), Santa Fe, Argentina.
4
These authors contributed
equally: Rodrigo Echeveste, Enzo Ferrante.
e-mail: maria.ricci@hospitalitaliano.org.ar;
recheveste@sinc.unl.edu.ar;eferrante@sinc.unl.edu.ar
Comment
nature communications (2022) 13:4581 | 4
Received: 8 March 2022; Accepted: 21 July 2022;
References
1. Esteva, A. et al. Deep learning-enabled medical computer vision. NPJ Digit. Med. 4,
19 (2021).
2. Litjens, G. et al. A survey on deeplearning in medical image analysis. Med. Image Anal. 42,
6088 (2017).
3. Lin, M. Whats needed to bridgethe gap between us fdaclearance and real-world use ofAI
algorithms. Acad. Radiol. 29,567568 (2022).
4. Buolamwini,J. & Gebru,T. Gender shades:Intersectional accuracydisparitiesin commercial
gender classication. In Conference on fairness, accountability and transparency, 7791
(PMLR, 2018).
5. Zou, J. & Schiebinger, L. AI can be sexist and racist - it's time to make it fair. Nature 559,
324326 (2018).
6. Beauchamp, T. L. & Childress,J. F. Principles of biomedical ethics (Oxford University
Press, 1979).
7. Chen, I. Y. et al. Ethical machine learning in healthcare. Ann. Rev.Biomed. Data Sci. 4,
123144 (2021).
8. Larrazabal, A. J., Nieto, N., Peterson, V., Milone, D. H.& Ferrante, E. Gender imbalance in
medical imaging datasets produces biased classiers for computer-aided diagnosis. Proc.
Natl Acad. Sci. 117,1259212594 (2020).
9. Seyyed-Kalantari, L., Zhang, H., McDermott, M., Chen, I. Y. & Ghassemi, M. Underdiagnosis
bias of articialintelligencealgorithmsapplied tochest radiographsin under-served patient
populations. Nat. Med. 27,21762182 (2021).
10. Burlina, P., Joshi, N., Paul, W., Pacheco, K. D. & Bressler,N. M. Addressing articial intelli-
gence bias in retinal diagnostics. Transl. Vis. Sci. Technol. 10,1313 (2021).
11. Puyol-Antón, E. et al. Fairness in cardiac mr image analysis: An investigation of bias due to
data imbalance in deep learning based segmentation. In International Conference on
Medical Image Computingand Computer-Assisted Intervention, 413423 (Springer, 2021).
12. Kinyanjui, N. M. et al. Fairnessof classiersacross skintones in dermatology. In International
Conference on Medical Image Computing and Computer-Assisted Intervention, 320329
(Springer, 2020).
13. Groh, M. et al.Evaluating deep neural networks trained on clinical images in dermatology
with the tzpatrick 17k dataset. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 18201828 (2021).
14. Joshi, N. & Burlina,P.Ai. Fairness via domain adaptation. Preprint at arXiv https://doi.or g/10.
48550/arXiv.2104.01109 (2021).
15. Paul, W., Hadzic, A., Joshi, N., Alajaji, F. & Burlina, P. Tara: training and representation
alteration for ai fairness and domain generalization. Neural Comput.34,716753 (2022).
16. Zhou, Y. et al. Radfusion: Benchmarking performance and fairness for multimodal pul-
monary embolism detection from ct and ehr. Preprint at arXiv https://doi.org/10.48550/
arXiv.2111.11665 (2021).
17. Abbasi-Sureshjani, S., Raumanns, R., Michels, B. E., Schouten, G. & Cheplygina, V. Risk of
training diagnostic algorithms on data with demographic bias. InInterpretable and
Annotation-Efcient Learning for Medical ImageComputing, 183192 (Springer, 2020).
18. Seyyed-Kalantari, L., Liu, G., McDermott, M., Chen, I. Y. & Ghassemi, M. Chexclusion: Fair-
ness gaps in deep chest x-ray classiers. In BIOCOMPUTING 2021: Proceedings of the
Pacic Symposium, 2 32243 (World Scientic, 2020).
19. Cheng, V.,Suriyakumar,V. M., Dullerud,N., Joshi, S.& Ghassemi, M.Can you fake it untilyou
make it? impacts of differentially private synthetic data on downstream classication fair-
ness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Trans-
parency, 149160 (Association for Computing Machinery (ACM), 2021).
20. Correa, R. et al. Two-step adversarial debiasingwith partial learningmedical image case-
studies. In AAAI 2022 Workshop: Trustworthy AIfor Healthcare. Preprint at arXivhttps://doi.
org/10.48550/arXiv.2111.08711 (2021).
21. Glocker, B. & Winzeck, S. Algorithmic encoding of protected characteristics and its impli-
cations on disparities across subgroups. Preprint at arXiv https://doi.org/10.48550/arXiv.
2110.147 55 (2021).
22. Suriyakumar,V. M., Papernot, N., Goldenberg, A. & Ghassemi, M. Chasing your long tails:
Differentially private prediction in health care settings. In Proceedings of the 2021 ACM
Conference on Fairness, Accountability, and Transparency, 723734 (Association for
Computing Machinery (ACM), 2021).
23. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K. & Galstyan, A. A survey on bias and
fairness in machine learning. ACM Comput. Surveys 54,135 (2021).
24. Li, X., Cui,Z., Wu, Y., Gu, L. & Harada,T. Estimating and improving fairness withadversarial
learning. Preprint at arXiv https://doi.org/10.48550/arXiv.2103.04243 (2021).
25. King, A.What do we want fromfair AI in medicalimaging? MMAG Blog Post. Available onl ine
at: http://kclmmag.org/blog/what-do-wewant-from-fair-ai-in-medical-imaging/ (2022).
26. Gichoya, J.W. et al. Ai recognition of patient race in medical imaging: a modelling study.
Lancet Digit. Health 4, E406E414(2022).
27. Kleinberg,J., Mullainathan, S. & Raghavan, M. Inherent trade-offs inthe fair determination of
riskscores. In Proceedings of Innovationsin TheoreticalComputer Science(ITCS).Preprint at
arXiv https://doi.org/10.48550/arXiv.1609.05807 (2017).
28. Hooker, S. Moving beyond "algorithmic bias is a data problem.Patterns 2, 100241 (2021).
29. Pfohl,S. R., Foryciarz,A. & Shah,N. H. An empirical characterizationof fair machinelearning
for clinical risk prediction. J. Biomed. Inform. 113,103621(2021).
30. Kaissis, G. A., Makowski, M. R., Rückert, D. & Braren, R. F. Secure, privacy-preserving and
federated machine learning in medical imaging. Nat. Mach. Intell. 2,305311 (2020).
31. Irvin,J. et al. Chexpert: a large chestradiograph dataset with uncertainty labels and expert
comparison. In Proceedings of the AAAI conference on articial intelligence, vol. 33,
590597 (Association for the Advancement of Articial Intelligence Press (AAAI
Press), 2019).
32. Wang,X. et al. Chestx-ray8:Hospital-scale chestx-ray databaseand benchmarkson weakly-
supervised classicationand localizationof common thoraxdiseases. In Proceedings of the
IEEE conference on computer vision and pattern recognition, 20972106 (IEEE, 2017).
33. Johnson, A. E. et al. Mimic-cxr, a de-identied publicly available database of chest radio-
graphs with free-text reports. Sci. Data 6,18 (2019).
34. Pisano, E. D. et al. Diagnostic performance of digital versus lm mammography for breast-
cancer screening. N. Engl. J. Med. 353,17731783 (2005).
35. Codella, N. et al. Skin lesionanalysis toward melanoma detection 2018:a challengehosted
by the international skinimaging collaboration (isic).Preprint at arXiv https://arxiv.org/abs/
1902.03368 (2019).
36. Rotemberg, V. et al. A patient-centric dataset of images and metadata for identifying mel-
anomas using clinical context. Sci. Data 8,18(2021).
37. Age-Related Eye DiseaseStudy Research Group.The age-relatedeye disease study (areds):
design implications areds report no. 1. Control. Clin. Trials 20, 573 (1999).
38. Petersen, S. E. et al. Uk biobanks cardiovascular magnetic resonance protocol. J. Cardio-
vasc. Magn. Reson. 18,17(2015).
39. Lekadir, K. et al. Future-ai: Guiding principles and consensus recommendations for trust-
worthy articial intelligence in medical imaging. Preprint at arXiv https://arxiv.org/abs/
2109.09658 (2021).
40. Wen, D. et al. Characteristics of publi cly available skin cancer image datasets: a system atic
review. Lancet Digit. Health 4,E64E74 (20 22).
41. Khan, S. M. et al. A global review of publicly available datasets for ophthalmological ima-
ging: barriers to access, usability, and generalisability. Lancet Digit. Health 3,
e51e66 (2021).
42. Ibrahim, H., Liu, X., Zariffa, N., Morris, A. D. & Denniston, A. K.Health data poverty:
an assailable barrier to equitable digital health care. Lancet Digit. Health 3,
E260E265 (2021).
43. Howard, F. M. et al.The impact of site-specicdigital histologysignatures on deeplearning
model accuracy and bias. Nat. Commun. 12,113 (2021).
44. The All of Us Research Program Investigators. The "all of usresearch program. N. Engl. J.
Med. 381,668676 (2019).
45. Ganz, M., Holm, S. H. & Feragen, A.Assessing bias in medical ai. In Workshop on Inter-
pretableML in Healthcareat International Connference on Machine Learning (ICML)(2021).
46. Tomasev, N., McKee, K. R., Kay, J. & Mohamed, S.Fairness for unobserved characteristics:
Insights from technological impacts on queer communities. In Proceedings of the 2021
AAAI/ACM Conference on AI, Ethics, and Society, AIES 21, 254265 (Association for
Computing Machinery, 2021). https://doi.org/10.1145/3461702.3462540.
47. Flanagin,A., Frey, T., Christiansen,S. L. & of Style Committee,A. M. et al. Updatedguidance
on the reporting of race and ethnicity in medical and science journals. JAMA 326,
621627 (2021).
48. Itri, J. N. & Patel,S. H. Heuristics andcognitive errorin medical imaging.Am. J. Roentgenol.
210,10971105 (2018).
49. Sun, X., Yang, J., Sun, M. & Wang, K.A benchmark for automatic visual classication of
clinical skin disease images. In European Conference on Computer Vision, 206222
(Springer, 2016).
50. Cuadros, J. & Bresnick, G. Eyepacs: an adaptable telemedicine system for diabetic retino-
pathy screening. J. Diabetes Sci. Technol. 3,509516 (2009).
Acknowledgments
We thank the Fundar foundation for supporting M.A.R.L. with a FunDatos
Scholarship and the Program for Articial Intelligence in Health at
Hospital Italiano de Buenos Aires for providing the space to discuss and
work on these issues. This work was supported by Argentinas National
Scientic and Technical Research Council (CONICET), who covered the
salaries of R.E. and E.F. The work of E.F. was partially supported by the
ARPH.AI project funded by a grant (Number 109584) from the Interna-
tional Development Research Center (IDRC) and the Swedish Interna-
tional Development Cooperation Agency (SIDA). We also acknowledge
the support of Universidad Nacional del Litoral (Grants CAID-PIC-
50220140100084LI, 50620190100145LI), Agencia Nacional de Promo-
ción de la Investigación, el Desarrollo Tecnológico y la Innovación
(Grants PICT 2018-3907, PRH 2017-0003) and Santa Fe Agency for Sci-
ence, Technology and Innovation (Award ID: IO-138-19).
Comment
nature communications (2022) 13:4581 | 5
Author contributions
E.F. provided the initial concept for this article, which was further
developed by all authors. M.A.R.L. conducted the literature search and
performed the systematic analysis across areas of application, methods
as well as strengths and vacancies. M.A.R.L. and R.E. produced the g-
ures. R.E. and E.F. supervised the analysis. All authors wrote the paper.
Competing interests
The authors declare no competing interests.
Additional information
Correspondence and requests for materials should be addressed to
María Agustina Ricci Lara, Rodrigo Echeveste or Enzo Ferrante.
Peer review information Nature Communications thanks Jakob Kather,
Judy Wawira Gichoya and the other, anonymous, reviewer(s) for their
contribution to the peer review of this work
Reprints and permission information is available at
http://www.nature.com/reprints
Publishers note Springer Nature remains neutral with regard to jur-
isdictional claims in published maps and institutional afliations.
Open Access This article is licensed under a Creative Commons
Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as
long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license, and indicate if
changes were made. The images or other third party material in this
article are included in the articles Creative Commons license, unless
indicated otherwise in a credit line to the material. If material is not
included in the articles Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright
holder. To view a copy of this license, visit http://creativecommons.org/
licenses/by/4.0/.
© The Author(s) 2022
Comment
nature communications (2022) 13:4581 | 6
... Another essential consideration for the deployment of these systems in real-world environments is to ensure their fairness, particularly by avoiding significant performance disparities between protected subgroups, such as different gender and race [5,6]. Despite growing awareness and research into the fairness of medical AI, current studies predominantly focus on singlemodality systems, especially those relying solely on imaging data for diagnosis [7][8][9][10][11][12][13][14]. In contrast, multimodal medical AI integrates various types of data, including clinical histories, patient records, laboratory results, and medical images, reflecting the diagnostic practices of clinicians and offering a more comprehensive approach to disease diagnosis [15]. ...
... Although a multi-dimensional evaluation [70,71] considering multiple metrics appears comprehensive, it may lead to assessment redundancy or inconsistency. In contrast, the equal opportunity criterion is more suitable in medical settings [10]. Hence, in our fairness analysis, we adopt the equal opportunity difference (EOD) [72] metric, using the difference in Recall between two groups to measure the disparity between them. ...
Preprint
Full-text available
Multimodal medical artificial intelligence (AI) is increasingly recognized as superior to traditional single-modality approaches due to its ability to integrate diverse data sources, which align closely with clinical diagnostic processes. However, the impact of multimodal information interactions on model fairness remains unknown, leading to a critical challenge for equitable AI deployment in healthcare. Here, we extend fairness research to multimodal medical AI and leverage large-scale medical vision-language models (VLMs) to provide guidelines for building fair multimodal AI. Training on large and diverse datasets enables medical VLMs to discern variances across populations, thereby offering a more equitable insight compared to single data sources. Our analysis covers three key medical domains—dermatology, radiology, and ophthalmology—focusing on how patient metadata interacts with medical images to affect model fairness across dimensions such as gender, age, and skin tone. Our findings reveal that the indiscriminate inclusion of all metadata may negatively impact fairness for protected subgroups and show how multimodal AI utilizes demographic information in metadata to influence fairness. In addition, we conducted an in-depth analysis of how clinical attributes affect model performance and fairness, covering more than 20 different attributes in dermatology. Finally, we proposed a fairness-oriented metadata selection strategy using recent advancements in large medical VLMs to guide attribute selection. Remarkably, we found that the fairness correlations computed by the medical VLM closely align with our experimental results, which required over 500 GPU hours, demonstrating a resource-efficient approach to guide multimodal integration. Our work underscores the importance of careful metadata selection in achieving fairness in multimodal medical AI. We anticipate that our analysis will be a starting point for more sophisticated multimodal medical AI models of fairness.
... It is beneficial to utilize a benchmark dataset to evaluate the presence of bias within specific subgroups of the populations mentioned above. However, in addition to this approach, various techniques can be employed during the development and post-processing of the model to mitigate these biases [77] such as generative AI techniques to augment the training data. For instance, Burlina et al [78] demonstrated that by generating synthetic fundus images of the eye, the discrepancies between individuals with dark and light skin tones were minimized. ...
... In recent years, numerous vendors have entered the medical imaging market with AI products to assist clinicians, and even though external validation might have been performed in a limited form in some cases [92], generalizability issues persist with CE-marked or FDAcleared models, depending on the end-users clinical context. While recommendations on reducing biases exist [3,59,74,77,93], they do not provide a foolproof guarantee against it. Besides this, AI companies most often do not disclose what data were used exactly to train their models making it hard to compare the training data to the data used in the local clinical setting. ...
Article
Full-text available
Various healthcare domains have witnessed successful preliminary implementation of artificial intelligence (AI) solutions, including radiology, though limited generalizability hinders their widespread adoption. Currently, most research groups and industry have limited access to the data needed for external validation studies. The creation and accessibility of benchmark datasets to validate such solutions represents a critical step towards generalizability, for which an array of aspects ranging from preprocessing to regulatory issues and biostatistical principles come into play. In this article, the authors provide recommendations for the creation of benchmark datasets in radiology, explain current limitations in this realm, and explore potential new approaches. Clinical relevance statement Benchmark datasets, facilitating validation of AI software performance can contribute to the adoption of AI in clinical practice. Key Points Benchmark datasets are essential for the validation of AI software performance. Factors like image quality and representativeness of cases should be considered. Benchmark datasets can help adoption by increasing the trustworthiness and robustness of AI. Graphical Abstract
... In recent years, the integration of artificial intelligence (AI) techniques into medical imaging has shown promising results and may have great potential to transform the diagnostic process [1]. Among various modalities, abdominal/ pelvic ultrasound imaging provides non-invasive visualization of internal organs and structures. ...
Article
Full-text available
Background In recent years, the integration of artificial intelligence (AI) techniques into medical imaging has shown great potential to transform the diagnostic process. This review aims to provide a comprehensive overview of current state-of-the-art applications for AI in abdominal and pelvic ultrasound imaging. Methods We searched the PubMed, FDA, and ClinicalTrials.gov databases for applications of AI in abdominal and pelvic ultrasound imaging. Results A total of 128 titles were identified from the database search and were eligible for screening. After screening, 57 manuscripts were included in the final review. The main anatomical applications included multi-organ detection (n = 16, 28%), gynecology (n = 15, 26%), hepatobiliary system (n = 13, 23%), and musculoskeletal (n = 8, 14%). The main methodological applications included deep learning (n = 37, 65%), machine learning (n = 13, 23%), natural language processing (n = 5, 9%), and robots (n = 2, 4%). The majority of the studies were single-center (n = 43, 75%) and retrospective (n = 56, 98%). We identified 17 FDA approved AI ultrasound devices, with only a few being specifically used for abdominal/pelvic imaging (infertility monitoring and follicle development). Conclusion The application of AI in abdominal/pelvic ultrasound shows promising early results for disease diagnosis, monitoring, and report refinement. However, the risk of bias remains high because very few of these applications have been prospectively validated (in multi-center studies) or have received FDA clearance.
... AI also brings risks and ethical issues, such as the need to ensure fairness, meaning it should not be biased against some group or minority [221]. Bias in AI software may result from unbalanced training data. ...
Article
Full-text available
Artificial intelligence (AI), the wide spectrum of technologies aiming to give machines or computers the ability to perform human-like cognitive functions, began in the 1940s with the first abstract models of intelligent machines. Soon after, in the 1950s and 1960s, machine learning algorithms such as neural networks and decision trees ignited significant enthusiasm. More recent advancements include the refinement of learning algorithms, the development of convolutional neural networks to efficiently analyze images, and methods to synthesize new images. This renewed enthusiasm was also due to the increase in computational power with graphical processing units and the availability of large digital databases to be mined by neural networks. AI soon began to be applied in medicine, first through expert systems designed to support the clinician’s decision and later with neural networks for the detection, classification, or segmentation of malignant lesions in medical images. A recent prospective clinical trial demonstrated the non-inferiority of AI alone compared with a double reading by two radiologists on screening mammography. Natural language processing, recurrent neural networks, transformers, and generative models have both improved the capabilities of making an automated reading of medical images and moved AI to new domains, including the text analysis of electronic health records, image self-labeling, and self-reporting. The availability of open-source and free libraries, as well as powerful computing resources, has greatly facilitated the adoption of deep learning by researchers and clinicians. Key concerns surrounding AI in healthcare include the need for clinical trials to demonstrate efficacy, the perception of AI tools as ‘black boxes’ that require greater interpretability and explainability, and ethical issues related to ensuring fairness and trustworthiness in AI systems. Thanks to its versatility and impressive results, AI is one of the most promising resources for frontier research and applications in medicine, in particular for oncological applications.
... On the contrary, clinicians pay more attention to the physiological causality between the two terms, for example, will the anatomical difference between the male and the female affect the diagnosis difficulty? This different paradigm of attribute selection also brings gaps between AI scientists and clinicians when they address fairness in MedIA together 92 . ...
Article
Full-text available
Deep learning algorithms have demonstrated remarkable efficacy in various medical image analysis (MedIA) applications. However, recent research highlights a performance disparity in these algorithms when applied to specific subgroups, such as exhibiting poorer predictive performance in elderly females. Addressing this fairness issue has become a collaborative effort involving AI scientists and clinicians seeking to understand its origins and develop solutions for mitigation within MedIA. In this survey, we thoroughly examine the current advancements in addressing fairness issues in MedIA, focusing on methodological approaches. We introduce the basics of group fairness and subsequently categorize studies on fair MedIA into fairness evaluation and unfairness mitigation. Detailed methods employed in these studies are presented too. Our survey concludes with a discussion of existing challenges and opportunities in establishing a fair MedIA and healthcare system. By offering this comprehensive review, we aim to foster a shared understanding of fairness among AI researchers and clinicians, enhance the development of unfairness mitigation methods, and contribute to the creation of an equitable MedIA society.
... However, it remains true that addressing ethical issues is a prerequisite for the formation of trust relationships or avoiding a loss of trust. Ethical concerns of medical AI include issues such as biases and fairness (Ricci Lara et al. 2022), transparency and explainability (Kempt et al. 2022) but also broader questions of distributive justice (Lehoux et al. 2019). Proposed and often already implemented mitigation options include high-level development of policy, legislation and regulation (CAHAI 2022;OECD 2019), including the EU's AI Act (European Commission 2021b; The European Parliament and the Council of the EU 2024), national policy and legislation (e.g. ...
Article
Full-text available
In this article, we explore questions about the culture of trustworthy artificial intelligence (AI) through the lens of ecosystems. We draw on the European Commission’s Guidelines for Trustworthy AI and its philosophical underpinnings. Based on the latter, the trustworthiness of an AI ecosystem can be conceived of as being grounded by both the so-called rational-choice and motivation-attributing accounts—i.e., trusting is rational because solution providers deliver expected services reliably, while trust also involves resigning control by attributing one’s motivation, and hence, goals, onto another entity. Our research question is: What aspects contribute to a responsible AI ecosystem that can promote justifiable trustworthiness in a healthcare environment? We argue that especially within devising governance and support aspects of a medical AI ecosystem, considering the so-called motivation-attributing account of trust provides fruitful pointers. There can and should be specific ways and governance structures supporting and nurturing trustworthiness beyond mere reliability. After compiling a list of preliminary requirements for this, we describe the emergence of one particular medical AI ecosystem and assess its compliance with and future ways of improving its functioning as a responsible AI ecosystem that promotes trustworthiness.
Article
Full-text available
Should the input data of artificial intelligence (AI) systems include factors such as race or sex when these factors may be indicative of morally significant facts? More importantly, is it wrong to rely on the output of AI tools whose input includes factors such as race or sex? And is it wrong to rely on the output of AI systems when it is correlated with factors such as race or sex (whether or not its input includes such factors)? The answers to these questions are controversial. In this paper, I argue for the following claims. First, since factors such as race or sex are not morally significant in themselves, including such factors in the input data, or relying on output that includes such factors or is correlated with them, is neither objectionable (for example, unfair) nor commendable in itself. Second, sometimes (but not always) there are derivative reasons against such actions due to the relationship between factors such as race or sex and facts that are morally significant (ultimately) in themselves. Finally, even if there are such derivative reasons, they are not necessarily decisive since there are sometimes also countervailing reasons. Accordingly, the moral status of the above actions is contingent.
Article
Full-text available
Questions of unfairness and inequity pose critical challenges to the successful deployment of artificial intelligence (AI) in healthcare settings. In AI models, unequal performance across protected groups may be partially attributable to the learning of spurious or otherwise undesirable correlations between sensitive attributes and disease-related information. Here, we introduce the Attribute Neutral Framework, designed to disentangle biased attributes from disease-relevant information and subsequently neutralize them to improve representation across diverse subgroups. Within the framework, we develop the Attribute Neutralizer (AttrNzr) to generate neutralized data, for which protected attributes can no longer be easily predicted by humans or by machine learning classifiers. We then utilize these data to train the disease diagnosis model (DDM). Comparative analysis with other unfairness mitigation algorithms demonstrates that AttrNzr outperforms in reducing the unfairness of the DDM while maintaining DDM’s overall disease diagnosis performance. Furthermore, AttrNzr supports the simultaneous neutralization of multiple attributes and demonstrates utility even when applied solely during the training phase, without being used in the test phase. Moreover, instead of introducing additional constraints to the DDM, the AttrNzr directly addresses a root cause of unfairness, providing a model-independent solution. Our results with AttrNzr highlight the potential of data-centered and model-independent solutions for fairness challenges in AI-enabled medical systems.
Article
Full-text available
Background Previous studies in medical imaging have shown disparate abilities of artificial intelligence (AI) to detect a person's race, yet there is no known correlation for race on medical imaging that would be obvious to human experts when interpreting the images. We aimed to conduct a comprehensive evaluation of the ability of AI to recognise a patient's racial identity from medical images. Methods Using private (Emory CXR, Emory Chest CT, Emory Cervical Spine, and Emory Mammogram) and public (MIMIC-CXR, CheXpert, National Lung Cancer Screening Trial, RSNA Pulmonary Embolism CT, and Digital Hand Atlas) datasets, we evaluated, first, performance quantification of deep learning models in detecting race from medical images, including the ability of these models to generalise to external environments and across multiple imaging modalities. Second, we assessed possible confounding of anatomic and phenotypic population features by assessing the ability of these hypothesised confounders to detect race in isolation using regression models, and by re-evaluating the deep learning models by testing them on datasets stratified by these hypothesised confounding variables. Last, by exploring the effect of image corruptions on model performance, we investigated the underlying mechanism by which AI models can recognise race. Findings In our study, we show that standard AI deep learning models can be trained to predict race from medical images with high performance across multiple imaging modalities, which was sustained under external validation conditions (x-ray imaging [area under the receiver operating characteristics curve (AUC) range 0·91–0·99], CT chest imaging [0·87–0·96], and mammography [0·81]). We also showed that this detection is not due to proxies or imaging-related surrogate covariates for race (eg, performance of possible confounders: body-mass index [AUC 0·55], disease distribution [0·61], and breast density [0·61]). Finally, we provide evidence to show that the ability of AI deep learning models persisted over all anatomical regions and frequency spectrums of the images, suggesting the efforts to control this behaviour when it is undesirable will be challenging and demand further study. Interpretation The results from our study emphasise that the ability of AI deep learning models to predict self-reported race is itself not the issue of importance. However, our finding that AI can accurately predict self-reported race, even from corrupted, cropped, and noised medical images, often when clinical experts cannot, creates an enormous risk for all model deployments in medical imaging. Funding National Institute of Biomedical Imaging and Bioengineering, MIDRC grant of National Institutes of Health, US National Science Foundation, National Library of Medicine of the National Institutes of Health, and Taiwan Ministry of Science and Technology
Article
Full-text available
We propose a novel method for enforcing AI fairness with respect to protected or sensitive factors. This method uses a dual strategy performing training and representation alteration (TARA) for the mitigation of prominent causes of AI bias. It includes the use of representation learning alteration via adversarial independence to suppress the bias-inducing dependence of the data representation from protected factors and training set alteration via intelligent augmentation to address bias-causing data imbalance by using generative models that allow the fine control of sensitive factors related to underrepresented populations via domain adaptation and latent space manipulation. When testing our methods on image analytics, experiments demonstrate that TARA significantly or fully debiases baseline models while outperforming competing debiasing methods that have the same amount of information—for example, with (% overall accuracy, % accuracy gap) = (78.8, 0.5) versus the baseline method's score of (71.8, 10.5) for Eye-PACS, and (73.7, 11.8) versus (69.1, 21.7) for CelebA. Furthermore, recognizing certain limitations in current metrics used for assessing debiasing performance, we propose novel conjunctive debiasing metrics. Our experiments also demonstrate the ability of these novel metrics in assessing the Pareto efficiency of the proposed methods.
Article
Full-text available
Artificial intelligence (AI) systems have increasingly achieved expert-level performance in medical imaging applications. However, there is growing concern that such AI systems may reflect and amplify human bias, and reduce the quality of their performance in historically under-served populations such as female patients, Black patients, or patients of low socioeconomic status. Such biases are especially troubling in the context of underdiagnosis, whereby the AI algorithm would inaccurately label an individual with a disease as healthy, potentially delaying access to care. Here, we examine algorithmic underdiagnosis in chest X-ray pathology classification across three large chest X-ray datasets, as well as one multi-source dataset. We find that classifiers produced using state-of-the-art computer vision techniques consistently and selectively underdiagnosed under-served patient populations and that the underdiagnosis rate was higher for intersectional under-served subpopulations, for example, Hispanic female patients. Deployment of AI systems using medical imaging for disease diagnosis with such biases risks exacerbation of existing care biases and can potentially lead to unequal access to medical treatment, thereby raising ethical concerns for the use of these models in the clinic.
Preprint
Full-text available
The use of artificial intelligence (AI) in healthcare has become a very active research area in the last few years. While significant progress has been made in image classification tasks, only a few AI methods are actually being deployed in hospitals. A major hurdle in actively using clinical AI models currently is the trustworthiness of these models. More often than not, these complex models are black boxes in which promising results are generated. However, when scrutinized, these models begin to reveal implicit biases during the decision making, such as detecting race and having bias towards ethnic groups and subpopulations. In our ongoing study, we develop a two-step adversarial debiasing approach with partial learning that can reduce the racial disparity while preserving the performance of the targeted task. The methodology has been evaluated on two independent medical image case-studies - chest X-ray and mammograms, and showed promises in bias reduction while preserving the targeted performance.
Article
Full-text available
Publicly available skin image datasets are increasingly used to develop machine learning algorithms for skin cancer diagnosis. However, the total number of datasets and their respective content is currently unclear. This systematic review aimed to identify and evaluate all publicly available skin image datasets used for skin cancer diagnosis by exploring their characteristics, data access requirements, and associated image metadata. A combined MEDLINE, Google, and Google Dataset search identified 21 open access datasets containing 106 950 skin lesion images, 17 open access atlases, eight regulated access datasets, and three regulated access atlases. Images and accompanying data from open access datasets were evaluated by two independent reviewers. Among the 14 datasets that reported country of origin, most (11 [79%]) originated from Europe, North America, and Oceania exclusively. Most datasets (19 [91%]) contained dermoscopic images or macroscopic photographs only. Clinical information was available regarding age for 81 662 images (76·4%), sex for 82 848 (77·5%), and body site for 79 561 (74·4%). Subject ethnicity data were available for 1415 images (1·3%), and Fitzpatrick skin type data for 2236 (2·1%). There was limited and variable reporting of characteristics and metadata among datasets, with substantial under-representation of darker skin types. This is the first systematic review to characterise publicly available skin image datasets, highlighting limited applicability to real-life clinical settings and restricted population representation, precluding generalisability. Quality standards for characteristics and metadata reporting for skin image datasets are needed.
Preprint
It has been rightfully emphasized that the use of AI for clinical decision making could amplify health disparities. A machine learning model may pick up undesirable correlations, for example, between a patient's racial identity and clinical outcome. Such correlations are often present in (historical) data used for model development. There has been an increase in studies reporting biases in disease detection models across patient subgroups. Besides the scarcity of data from underserved populations, very little is known about how these biases are encoded and how one may reduce or even remove disparate performance. There is some speculation whether algorithms may recognize patient characteristics such as biological sex or racial identity, and then directly or indirectly use this information when making predictions. But it remains unclear how we can establish whether such information is actually used. This article aims to shed some light on these issues by exploring new methodology allowing intuitive inspections of the inner working of machine learning models for image-based detection of disease. We also evaluate an effective yet debatable technique for addressing disparities leveraging the automatic prediction of patient characteristics, resulting in models with comparable true and false positive rates across subgroups. Our findings may stimulate the discussion about safe and ethical use of AI.
Chapter
The subject of ‘fairness’ in artificial intelligence (AI) refers to assessing AI algorithms for potential bias based on demographic characteristics such as race and gender, and the development of algorithms to address this bias. Most applications to date have been in computer vision, although some work in healthcare has started to emerge. The use of deep learning (DL) in cardiac MR segmentation has led to impressive results in recent years, and such techniques are starting to be translated into clinical practice. However, no work has yet investigated the fairness of such models. In this work, we perform such an analysis for racial/gender groups, focusing on the problem of training data imbalance, using a nnU-Net model trained and evaluated on cine short axis cardiac MR data from the UK Biobank dataset, consisting of 5,903 subjects from 6 different racial groups. We find statistically significant differences in Dice performance between different racial groups. To reduce the racial bias, we investigated three strategies: (1) stratified batch sampling, in which batch sampling is stratified to ensure balance between racial groups; (2) fair meta-learning for segmentation, in which a DL classifier is trained to classify race and jointly optimized with the segmentation model; and (3) protected group models, in which a different segmentation model is trained for each racial group. We also compared the results to the scenario where we have a perfectly balanced database. To assess fairness we used the standard deviation (SD) and skewed error ratio (SER) of the average Dice values. Our results demonstrate that the racial bias results from the use of imbalanced training data, and that all proposed bias mitigation strategies improved fairness, with the best SD and SER resulting from the use of protected group models.