Content uploaded by Adrian Gabriel Zucco
Author content
All content in this area was uploaded by Adrian Gabriel Zucco on Nov 25, 2022
Content may be subject to copyright.
U NI VE R S IT Y O F CO PE NH AG EN
F A C U L TY O F H E A L TH A N D M E D IC A L SC I E N CE S
Explainable Machine Learning for
Precision Medicine of
Patients with Infectious Diseases
PhD Thesis
Adrian Gabriel Zucco, MSc
This t hesi s has been submitted to the Graduate School of Health and M edic al S ci en ces,
Un iv ersi ty of Copenhagen o n the 5th of A ug ust 2022 .
1
U NI VE R S IT Y O F CO PE NH AG EN
F A C U L TY O F H E A L TH A N D M E D I CA L S CI E N C ES
EXPLAINABLE MACHINE LEARNING FOR
PRECISION MEDICINE OF
PATIENTS WITH INFECTIOUS DISEASES
PhD Thesis
Adrian Gabriel Zucco
2
” Strømmen fanger dig.
Svøm MED strømmen – aldrig MOD strømmen,
som kan træke dig ned.
Når strømmen bliver svagere -
Svøm rundt i en bue tilbage mod land.”
” The current catches you.
Swim WITH the current – never AGAINST the current
as it may pull you down.
When the current becomes weaker,
Swim towards the shore in a curve.”
- Inspirational sign at a beach in Nationalpark Vadehavet
3
Studies
This thesis is based on two studies referred to by roman numerals I - II
I.
Associations of functional HLA class I groups with HIV viral load in a
heterogeneous cohort.1
Adrian G. Zucco, Marc Bennedbæk, Christina Ekenberg, Migle Gabrielaite,
Preston Leung, Mark N. Polizzotto, Virginia Kan, Daniel D. Murray, Jens D.
Lundgren and Cameron R. MacPherson for the INSIGHT START study group.
medRxiv, June 2022.
(Submitted to AIDS)
II.
Personalized survival probabilities for SARS-CoV-2 positive patients by
explainable machine learning.2
Adrian G. Zucco, Rudi Agius, Rebecka Svanberg, Kasper S. Moestrup, Ramtin Z.
Marandi, Cameron Ross MacPherson, Jens Lundgren, Sisse R. Ostrowski*,
Carsten U. Niemann*
*Co-senior authors.
medRxiv, Oct. 2021.
(Accepted at Scientific Reports)
Correspondence
Adrian Gabriel Zucco, MSc.
CHIP, Centre of Excellence for Health Immunity and Infections,
PERSIMUNE, Centre of Excellence, Rigshospitalet, University of Copenhagen, Denmark
Telephone: +45 35 45 57 75
e-mail: adrian.gabriel.zucco@regionh.dk
Financial support
This PhD has been supported by the Danish National Research Foundation (DNRF126) and
a COVID-19 grant from the Ministry of Higher Education and Science (0238-00006B) while
conducting research at CHIP (Centre of Excellence for Health, Immunity and Infections) and
PERSIMUNE (Centre of Excellence for personalized medicine of infectious complications in
immune deficiency).
4
Relevant scientific contributions
The research presented in this thesis has been an important contribution to the following
studies:
Association Between Single-Nucleotide Polymorphisms in HLA Alleles and Human
Immunodeficiency Virus Type 1 Viral Load in Demographically Diverse, Antiretroviral
Therapy–Naive Participants from the Strategic Timing of AntiRetroviral Treatment Trial.3
Ekenberg, C., Tang, M.-H., Zucco, A. G., Murray, D. D., MacPherson, C. R., Hu, X.,
Sherman, B. T., Losso, M. H., Wood, R., Paredes, R., Molina, J.-M., Helleberg, M., Jina, N.,
Kityo, C. M., Florence, E., Polizzotto, M. N., Neaton, J. D., Lane, H. C., & Lundgren, J. D.
The Journal of Infectious Diseases
,
2019,
220
(8), 1325–1334.
The association of human leukocyte antigen alleles with clinical disease progression in HIV-
positive cohorts with varied treatment strategies.4
Ekenberg, C., Reekie, J., Zucco, A. G., Murray, D. D., Sharma, S., Macpherson, C. R.,
Babiker, A., Kan, V., Lane, H. C., Neaton, J. D., Lundgren, J. D., & for the INSIGHT START,
S. S. G.
AIDS
,
2021,
35(5), 783–789.
Human Immunotypes Impose Selection on Viral Genotypes Through Viral Epitope
Specificity.5
Gabrielaite, M.*, Bennedbæk, M.*, Zucco, A. G., Ekenberg, C., Murray, D. D., Kan, V. L.,
Touloumi, G., Vandekerckhove, L., Turner, D., Neaton, J., Lane, H. C., Safo, S., Arenas-
Pinto, A., Polizzotto, M. N., Günthard, H. F., Lundgren, J. D., Marvig, R. L., & INSIGHT START
Study Group.
*Contributed equally
The Journal of Infectious Diseases
,
2021,
224
(12), 2053–2063.
Readmissions, post-discharge mortality and sustained recovery among patients admitted to
hospital with COVID-19.6
Moestrup, K. S., Reekie, J., Zucco, A. G., Jensen, T. Ø., Jensen, J.-U. S., Wiese, L.,
Ostrowski, S. R., Niemann, C. U., MacPherson, C. R., Lundgren, J., & Helleberg, M
.
Clinical Infectious Diseases,
2022
(In press)
5
Supervisors
Supervisor:
Jens D. Lundgren, Clinical Professor, MD, DMSc.
CHIP, Centre of Excellence for Health Immunity and
Infections, PERSIMUNE, Centre of Excellence,
Rigshospitalet, University of Copenhagen, Denmark.
Co-supervisor:
Ole Winther, Professor, MSc., PhD.
Department of Biology, Bioinformatics Centre, University
of Copenhagen, Denmark.
Center for Genomic Medicine, Rigshospitalet,
Copenhagen University Hospital, Copenhagen, Denmark .
Section for Cognitive Systems, Department of Applied
Mathematics and Computer Science, Technical University
of Denmark, Kongens Lyngby, Denmark.
Co-supervisor:
Cameron R. MacPherson, MSc., PhD.
CHIP, Centre of Excellence for Health Immunity and
Infections, PERSIMUNE, Centre of Excellence,
Rigshospitalet, University of Copenhagen, Denmark.
Assessment committee
Chair
Ole Kirk, Associate professor, MD, DMSc.
Department of Infectious Diseases,
Rigshospitalet, University of Copenhagen, Denmark
Assessor
Anders Gorm Pedersen, Professor, MSc., PhD.
Department of Health Technology,
Danmarks Tekniske Universitet, Denmark
Assessor
Girish N. Nadkarni, Associate professor, MSc., MD., PhD.
Icahn School of Medicine at Mount Sinai, United States
6
Table of contents
1. PREFACE AND ACKNOWLEDGEMENTS 8
2. LIST OF ABBREVIATIONS 9
3. SUMMARY 10
3.1. ENGLISH SUMMARY 10
3.2. DANSK RESUMÉ 11
4. INTRODUCTION 12
4.1. THE NEW PARADIGM OF PRECISION MEDICINE 12
4.2. INFERENCE AND PREDICTION IN PRECISION MEDICINE 13
4.3. ARTIFICIAL INTELLIGENCE, MACHINE LEARNING AND DEEP LEARNING 14
4.4. TRANSPARENCY, INTERPRETABILITY AND EXPLAINABILITY OF MACHINE LEARNING MODELS 15
4.5. ARTIFICIAL INTELLIGENCE FOR PATIENTS WITH INFECTIOUS DISEASES 16
4.6. PREDICTIVE MODELS DURING THE SARS-COV-2 PANDEMIC 17
4.7. PRECISION MEDICINE OF PATIENTS WITH INFECTIOUS DISEASES 18
4.8. HOST GENOMICS IN HIV INFECTION 19
5. OBJECTIVES 20
6. METHODS 21
6.1. DATA SOURCES 21
6.1.1. THE STRATEGIC TIMING OF ANTIRETROVIRAL TREATMENT (START) COHORT 21
6.1.2. ELECTRONIC HEALTH RECORDS IN THE CONTEXT OF THE SARS-COV-2 PANDEMIC 21
6.2. ETHICAL CONSIDERATIONS 22
6.3. ENCODING ELECTRONIC HEALTH RECORDS FOR MACHINE LEARNING 22
6.4. BIOINFORMATICS APPROACHES FOR VIRAL GENOMICS 22
6.5. SUPERVISED MACHINE LEARNING 23
6.5.1. HUMAN LEUCOCYTE ANTIGEN IMPUTATION 23
6.5.2. BINDING AFFINITY PREDICTION OF HLA CLASS I ALLELES TO HIV PEPTIDES 23
6.5.3. SURVIVAL ANALYSIS BY DISCRETE-TIME MODELLING 23
6.6. UNSUPERVISED MACHINE LEARNING 25
6.6.1. CONSENSUS CLUSTERING 25
7
6.7. TRAINING AND ASSESSMENT OF MACHINE LEARNING MODELS 25
6.8. EXPLAINABILITY OF MACHINE LEARNING MODELS 26
6.9. STATISTICAL ANALYSES 27
6.10. SOFTWARE AND VISUALIZATION TOOLS 27
7. SUMMARY OF RESULTS 28
7.1. STUDY I 28
7.2. STUDY II 30
8. DISCUSSION 33
8.1. IMPUTATION AND FUNCTIONAL CLUSTERING OF HLA ALLELES BY MACHINE LEARNING MODELS 33
8.1.1. ASSOCIATIONS OF FUNCTIONAL HLA CLASS I CLUSTERS WITH HIV VIRAL LOAD 33
8.2. EXPLAINABLE MACHINE LEARNING FOR SURVIVAL MODELS IN PRECISION MEDICINE 34
8.2.1. RISK FACTORS IN SARS-COV-2+ PATIENTS THROUGH MODEL EXPLAINABILITY 35
8.3. STRENGTHS AND LIMITATIONS 36
9. CONCLUSION 38
10. FUTURE PERSPECTIVES 38
10.1. HOW TO MODEL: FROM PREDICTIVE TO CAUSAL MODELS 39
10.2. WHAT TO MODEL: DATA-DRIVEN MODELS AND COMPLEXITY 39
11. REFERENCES 40
12. MANUSCRIPTS 47
8
1. Preface and acknowledgements
Back in early 2018, my interest in applying what I learned about Machine Learning to
understand our immune system and contribute to the fight against infectious diseases led me
to contact Jens Lundgren. He not only accepted me at CHIP and PERSIMUNE but since that
day he supported my endeavours balanced with scientific scrutiny guiding me not only in the
nuances of the medical world but also pushing me to become a better clinical researcher. For
doing so while always keeping a personal interaction I am immensely grateful and inspired.
I would also like to express my gratitude to my co-supervisors. Thanks to Cameron MacPherson
for his support and understanding as a fellow bioinformatician and curious being. Also, thanks
to Ole Winther for his sharp advice and pragmatic guidance when engaging in stimulating
scientific discussions. Our shared interest to bring Machine Learning to the clinic has been a
motivation. A special acknowledgement to Daniel Murray, his support, help and encouragement
to write has been pivotal for me to consider him as my unofficial co-supervisor.
This thesis would have not been possible without nurturing collaborations. Thanks to Christina
Ekenberg for her kindness and fruitful work together, Kasper Moestrup for always being there
side by side, Rudi Agius for shared moments and passion for our crafts. Thanks for the good
collaboration across departments with Rebecka Svanberg, Sisse Ostrowski, Carsten Niemann,
Marie Helleberg and Rasmus Marvig. Also, to the international collaborators from INSIGHT and
welcoming researchers in the translational immunology group at Institut Pasteur. My
appreciation also goes to the participants of the studies for their contribution.
I would like to say thanks to my fellow bioinformaticians Migle Gabrielaite, Mette Jørgensen,
Preston Leung, Ramtin Marandi, Kirstine Krøyer Rasmussen, Pernille Iversen and Man-Hung
Tang for their camaraderie and nice discussions. Also, to the fellow statisticians, Joanne Rekkie,
Erich Tusch and Quenia dos Santos. Moreover, thanks to Jens Christian and Marc Bennedbæk
for contributing to the Ph. in my PhD.
A huge thanks to Lisbeth Jørgensen, Helle Bo Duus and Lisbeth Bille for making my life much
easier with their help. I would like to acknowledge the talented team at PERSIMUNE and CHIP
that I had the honour to interact along the years: Dorthe Raben, Alvaro Borges, Bastian
Neesgaard, Cynthia Terrones Campos, Emma Illett, Isabelle Lodding, Cornelia Crone, Sara
Mørup, Sebastian Moretto, Christian Jensen, Riia Sustarsic, Jamshed Gill, Olga Fursa, Lars
Peters, Nadine Jaschinski, Frederik Woldbye... The list continues but I would also like to give
kudos to the IT department and public health teams for their work, with a special mention to
Tina Bruun for her joy and Anne Raahauge for the interesting interdisciplinary discussions.
On a personal level, I would not have reached far without the company, fun moments, and
therapeutic conversations with my dear friends Angélica, Jakub, Zulema, Maria Luisa, Rocío,
my Sydhavn crew including Jon, my beloved biochemists, my high school brothers, and friends
Emma and Lucy.
Last but not least, I want to thank the love and support of my wonderful partner Rasa, my
parents (Danilo and Miriam) and my family in Argentina, especially my grandmother and
godmother Sandra. My academic achievements are also theirs.
9
2. List of abbreviations
Abbreviation
Description
ACE2
Angiotensin converting enzyme 2
AI
Artificial Intelligence
AIDS
Acquired immunodeficiency syndrome
ART
Antiretroviral therapy
ATC
Anatomical Therapeutic Chemical
AUROC
Area under the receiver operating characteristic
CCR5
C-C chemokine receptor type 5
CD4
Cluster of differentiation 4
CI
Confidence interval
COVID-19
Coronavirus disease 2019
ESS
Error sum-of-squares
FPT
First positive SARS-CoV-2 test
HIV
Human Immunodeficiency Virus
HLA
Human Leucocyte Antigen
ICD-10
International Statistical Classification of Diseases and Related Health
Problems version 10
ICU
Intensive care unit
IML
Interpretable Machine Learning
INSIGHT
International Network for Strategic Initiatives in Global HIV Trials
KIR
Killer immunoglobulin-like receptors
mAI
Medical Artificial Intelligence
MCC
Mathew correlation coefficient
MHC
Major histocompatibility complex
NLP
Natural language processing
PCA
Principal Component Analysis
PDP
Partial dependence plot
PLWH
People living with HIV
PR-AUC
Precision-recall area under the curve
RNA
Ribonucleic acid
RT-PCR
Real-Time Polymerase Chain Reaction
RWD
Real-world data
SARS-CoV-2
Severe acute respiratory syndrome coronavirus 2
SHAP
SHapley Additive exPlanations
SNP
Single Nucleotide Polymorphism
START
Strategic Timing of AntiRetroviral Treatment
VL
Viral load
WHO
World Health Organization
10
3. Summary
3.1. English summary
Precision medicine is developing as a new paradigm in healthcare. To achieve its goals,
refined characterizations of patients and personalized clinical models are necessary.
Progress has been made in these areas by the inference of relevant genomics factors and
the proposal of better predictive models for clinical outcomes. This has been possible
through the development of statistics and computer science, in particular Machine Learning,
providing the tools for modelling the complexity in precision medicine. Nevertheless, to move
beyond purely predictive models, and gain insights from Machine Learning we need to
consider approaches to explain such models. This is especially relevant in the context of
infectious diseases, where the identification of patients at risk can improve clinical outcomes
and help understand critical disease mechanisms. An example of this has been observed
during the HIV pandemic in which some host genetic factors, in particular the Human
Leucocyte Antigen (HLA), have been linked to variation in HIV viral load. However, studies
of the HLA region have been challenging in heterogeneous populations due to its diversity.
More recently, during the SARS-Cov-2 pandemic, healthcare systems have been strained,
highlighting the need for early and precise patient risk assessment. In this thesis, Machine
Learning solutions to key problems in precision medicine of patients with infectious diseases
are presented in the context of HIV and SARS-CoV-2 infections powered by model
explainability.
In Study I, we investigated associations of HLA alleles to HIV viral load (VL) in a genetically
diverse cohort of 2546 HIV+ participants from the Strategic Timing of AntiRetroviral
Treatment (START) study by imputing and accounting for functional relationships between
HLA alleles. Machine Learning methods were used to impute and cluster HLA alleles based
on their predicted binding affinities to HIV and unspecific peptides. We found four major
functional clusters representing 30 HLA alleles accounting for over 11% of the variability in
VL previously reported in homogeneous cohorts. Some of these alleles while present in
distinct populations shared a common function reflecting similar effects. These effects were
also found using unspecific peptides hence the proposed methodology could be used for the
study of HLA alleles in other diseases.
In Study II, we developed a Machine Learning model to predict mortality within 12 weeks of
a first positive SARS-CoV-2 test based on 33938 cases during the first year of the COVID-
19 pandemic in Denmark. By implementing a discrete-time modelling approach leveraging
electronic health records into models with high performance, we could predict and explain
personalized survival curves and temporal dynamics. Among the final 22 features, age, sex,
number of medications, previous hospitalizations and lymphocyte counts were the top risk
factors for mortality. Compared to previous models developed for COVID-19 we account for
censored patients and missing values while providing explainable predictions that are
patient-specific. The proposed methodology can be used in other clinical problems as a
framework for predictive models in precision medicine.
11
3.2. Dansk resumé
Præcisionsmedicin er et nyt paradigme indenfor sundhedsbehandling, der forudsætter en
detaljeret karakterisering af patienter og kliniske modeller tilpasset den enkelte patient.
Prædiktive modeller for kliniske resultater. Forudsigelser af relevante genomiske faktorer og
bedre prædiktive modeller har været afgørende for den opnåede fremskridt. Dette har været
muligt gennem udviklingen af statistiske og datalogiske metoder, især maskinlæring, der gør
det muligt at modellere kompleksiteten i data. For at komme ud over rent prædiktive modeller
er vi udvikle metoder til at fortolke modellerne og deres forudsigelser. Dette er især relevant i
forbindelse med infektionssygdomme, hvor identifikation af patienter i risikogrupper, kan
forbedre kliniske resultater og hjælpe med at forstå kritiske sygdomsmekanismer. Et eksempel
på dette er blevet observeret under HIV-pandemien, hvor nogle genetiske værtsfaktorer, især
Human Leucocyte Antigen (HLA), er blevet forbundet med variation i HIV-virusmængden.
Imidlertid har undersøgelser af HLA-regionen været udfordrende i heterogene populationer.
Under SARS-Cov-2-pandemien er sundhedsvæsner blevet overbebyrdet, hvilket tydeliggør
behovet for tidlig og præcis patientrisikovurdering. I denne ph.d.-afhandling præsenteres
maskinlæring-løsninger til præcisionsmedicin for patienter med HIV og SARS-CoV-2 infektioner
drevet af modelforklarlighed.
I Studie I, undersøgte vi associationer mellem HLA-alleller og HIV-viral load (VL) i en genetisk
forskelligartet kohorte af 2546 HIV+-patienter fra Strategic Timing of AntiRetroviral Treatment
(START)-studiet ved at tilregne og redegøre for funktionelle relationer mellem HLA-alleler.
Maskinlæringsmetoder blev brugt til at imputere og gruppere HLA-alleller baseret på deres
forudsagte bindingsaffiniteter til HIV og uspecifikke peptider. Vi fandt fire store funktionelle
klynger, der repræsenterede 30 HLA-alleler, der tegner sig for over 11% af variabiliteten i VL,
der tidligere er rapporteret i homogene kohorter. Nogle af disse alleler, der var til stede i
forskellige populationer, delte en fælles funktion, der afspejlede lignende virkninger. Disse
virkninger blev også fundet ved hjælp af uspecifikke peptider, og derfor kunne den foreslåede
metode bruges til undersøgelse af HLA-alleller i andre sygdomme.
I Studie II, udviklede vi en maskinlæring model til at forudsige dødelighed indenfor 12 uger efter
en første positiv SARS-CoV-2-test baseret på 33938 tilfælde i løbet af det første år af COVID-
19-pandemien i Danmark. Ved at implementere en diskret-tidsmodelleringstilgang, der udnytter
elektroniske sundhedsjournaler til modeller med høj nøjagtighed, kunne vi forudsige og forklare
personlige overlevelseskurver og tidsmæssig dynamik. Af de endelige 22 karakteristika var
alder, køn, medicinforbrug, tidligere indlæggelser og lymfocyttal de største risikofaktorer for
dødelighed. Sammenlignet med tidligere modeller udviklet til COVID-19 tager vi højde for
censurerede patienter og manglende værdier, mens vi giver fortolkelige forudsigelser, der er
patientspecifikke. Den foreslåede metode er general og kan derfor anvendes til andre
prædiktive præcisionsmedicin-problemstillinger.
12
4. Introduction
Concepts such as precision medicine, artificial intelligence, machine learning or model
explainability are populating the medical literature. Understanding what these terms mean is
critical to establishing a basis of knowledge in which these concepts can be later actualised.
In this chapter, definitions of these important topics will be expanded and presented in the
context of patients with infectious diseases, in particular, those affected by the Human
Immunodeficiency Virus (HIV) and the Severe Acute Respiratory Syndrome Coronavirus 2
(SARS-CoV-2).
4.1. The new paradigm of precision medicine
In recent years, “precision medicine” has become a new focus in clinical research and
practice. This concept has gained favour over the term “personalised medicine” under the
premise that physicians have always aimed at providing individual and personalized
treatments informed by a personal relationship with their patients7. While the two terms are
used interchangeably, precision medicine emphasizes the need for incorporating a deeper
characterisation of patients to tailor disease prevention and treatment. Despite a clear
definition is still debated, precision medicine can be understood as an iterative process of
accurate patient stratification up to the individual level through the development of clinically
relevant models that incorporate genomic, clinical, lifestyle and environmental information8
Key to the development of precision medicine is the recent efforts in the digitalization of
healthcare systems and the reduced costs of deep phenotyping of patients through high-
throughput approaches7. The progressive implementation of electronic health records (EHR)
has allowed the widespread gathering of data from populations around the world influencing
clinical decision-making9. The availability of this vast amount of information not only has
increased the quality and quantity of data available for research but also enables the
possibility of disease surveillance through national databases. The availability of affordable
high-throughput techniques has facilitated the collection of detailed biological information at
the individual level. The most relevant case can be observed in the advent of array
technologies since the early 2000s10. These methods have allowed the study of genomic
variation to the level of single nucleotide polymorphisms (SNPs) powering a boom in genome-
wide association studies (GWAS). The growing interest in such technologies has catalysed
the launch of consumer-based platforms and national initiatives to genotype whole
populations. A clear example of this can be observed in the Nordic countries where a
combination of robust health registries, universal healthcare and investment to collect
genotypes and genomes of their citizens prove as fertile ground for precision medicine11
Despite the progress made in terms of generating high-quality data, concerns arise on how
to turn the rich and complex clinical information gathered into models of disease prevention
and treatment adapted to the individual. This can be partially attributed to outdated
epidemiological and statistical methods that focus on the average person12. While useful to
reach the current scientific knowledge, new methodologies are needed to model the
complexity present in medicine and healthcare13.
13
4.2. Inference and prediction in precision medicine
Since the scientific revolution in the 16th century, the method of observation, hypothesis
testing through experimentation and proposal of scientific models have been enriched by
mathematical applications. This evolution gave birth to modern statistics in the late 19th
century coupled with the ever-growing data collection and techniques for measuring relevant
metrics facilitated by the industrial revolution. The simplest expression of this field can be
seen in descriptive approaches to summarize data or represent data structures14
corresponding to the most general understanding of statistics. “Statistical learning” or
“statistical modelling” builds and expands this basic understanding to capture relationships
between measured and future observations. It can be represented as a set of approaches
for estimating a function
f
that links a set of observable variables
X
to some response
Y
in
the form:
Where ε corresponds to a random error term, independent of
X
. Estimating
f
is needed for
inference and prediction. When performing inference the main goal is to understand the
relationships between the variables
X
and the response
Y
or in other words how
Y
change
as a function of
X
. Through inference we can answer questions about associations, relevant
variables or the relationship of each of them to the response of interest15. For prediction
tasks, given a new set of values for the input variables
X
the goal is to accurately predict new
responses
Y
given that the irreducible error term ε is low enough15. In any case, how we
estimate
f
depends on the task of interest. For inference, we need to know the exact form of
to define the relationships between
X
and
Y
whereas for prediction tasks,
does not have
to be defined as far as it provides accurate predictions for
Y.
This type of reasoning
corresponds to cases in which responses or outcomes have been measured in a supervised
learning setting. Alternatively, relationships between the variables can be explored without a
know response or outcome in an unsupervised manner.
Traditionally, inference has been prioritised in the scientific literature over prediction. This
has led to a predominance of data models, in which to define the relationship between
variables and responses of interest, a known stochastic data model is proposed to estimate
f.
Examples of this can be seen in linear regression, logistic regression or Cox models where
goodness-of-fit tests and residual analyses are carried out to assess how well the model fits
the data16. This approach proved to be efficient to estimate the parameters of the proposed
models through direct mathematical calculation. Later on, with the development of computer
science, the algorithmic modelling culture gained relevance. With this approach, an
algorithm is applied on
X
to predict
Y
for which the form of
f
is not previously defined, and it
is learned based on the data. In this case, the model parameters are estimated by an
optimization process denominated “training” and then assessed by their predictive accuracy
based on excluded data during the model generation16. Examples of such algorithms are
neural networks and decision trees.
In the context of precision medicine, these foundational notions of statistical learning have a
critical impact on the understanding of findings in the field and its objectives. The preference
to perform inference through data models require very strong and restrictive assumptions to
define
f
and describe the relationship between the input variables
X
and the response
Y
.
14
Among these assumptions, the most common ones are that such relationships are additive
and linear15. Imposing such assumptions for the understanding of biological systems, while
useful as approximations, they have been proved problematic in genetic studies where
alternatives such as multiplicative models have been proposed17. Furthermore, the validity of
such data models and assumptions is conditional on their fit to the data, which is rarely
assessed or tested16. Erroneous conclusions can be drawn based on coefficients of such
models using a significance level of 5% without proper testing of assumptions and model
fit18. This is especially relevant when analysing high-throughput data where multiple testing
corrections are required19. Even if data models are correctly implemented, they are
commonly used to test associations on observational data implying a particular type of causal
reasoning different from the causal assumptions used in an interventional setting such as
randomised clinical trials. Due to these concerns, statisticians have argued that predictive
models can help overcome some of these limitations by relaxing assumptions and focusing
on reproducibility14. A popularisation of predictive modelling is being fuelled by new advances
in computer science, algorithms, big data approaches and accessibility of affordable
computing.
4.3. Artificial intelligence, Machine Learning and Deep Learning
Technological advancements powered by research in mathematics framed under new
philosophical considerations have fostered the idea of artefacts or machines that could
perform human tasks. Since the first mechanical computer was invented in the early 19th
century and the later proposal of the modern computer in 193720, access to computers and
their capabilities have improved, becoming widespread in our society. Although coined in the
50s, the concept of Artificial Intelligence has become popular in the last decade as a goal to
materialize the idea of technology aiding humans through automatization.
Artificial intelligence (AI) can be understood as the capabilities of machines to enact
knowledge encoded in formal language using logical inference rules21. Initially, this
knowledge was explicitly encoded by humans in a “knowledge-based” approach. An
example of this is the usage of “if-else” statements as a basic form of AI. Such tasks while
formally simple to define are more challenging for humans to perform compared to a
computer. Despite being successful in many domains including medicine22 these systems
encounter limitations in problems such as image and speech recognition that are complex
to formalize in simpler rules but intuitive for humans. To overcome this, Machine Learning
(ML) was proposed as a branch of AI to emphasize the ability of computational systems to
acquire knowledge. In this way, the focus shifted to developing algorithms that could learn
complex rules from data through error minimization. The main challenges in ML are how to
encode data into numerical vectors so such algorithms can learn meaningful representations
and how to define formal objectives to optimize according to the task at hand. The
importance of representation and abstractions learned by computational systems as a
successful approach for complex (but intuitive) tasks such as speech recognition and image
classification led to the development of Deep Learning23. As a sub-field of Machine Learning,
Deep Learning (DL) arose from the popularization of Artificial Neural Networks (ANNs),
algorithms inspired by mathematical models of biological neurons. After the proposal in the
mid-80s of stochastic gradient descent to train the parameters of such networks24, these
15
could accommodate more layers of neurons becoming “deeper” in their structure hence.
Nowadays, AI is conflated with ML or DL hence noticing the hierarchical ontology of such
concepts is necessary.
Prediction in the context of ML has been denominated “supervised learning”. In this setting,
a set of input variables named “features” are used for training models to predict a known
response or output. Depending on the type of output value being quantitative or qualitative
ML can be used to perform regression or classifications respectively25. The input features
can be of any type as far as they can be encoded numerically into feature vectors.
Nevertheless, special considerations are taken when dealing with temporal data which can
be framed in the context of forecasting, time series or survival analysis26. If an output value
is not available or prediction is not required, ML can be used also to learn representations,
patterns and clusters from data in what is called unsupervised learning. This approach has
been argued to be similar to how humans and animals learn23. Nevertheless, while successful
at prediction and representation learning, inference through ML has been complicated due
to the complexity of the models generated. The lack of transparency from such complexity
has been denominated as a “black box”. To overcome this, new concepts and computational
approaches have been proposed to open the black box.
4.4. Transparency, interpretability and explainability of machine learning
models
Understanding how predictions of machine learning models are generated is critical for
scientific inquiry27. The importance of such understanding is even codified into law as these
models infuse other areas of society28. However, the challenges of generating explanations
are not only applicable to ML but an important epistemological problem in other areas of
knowledge29. In the last decade, the field of explainable AI (xAI) or interpretable machine
learning (IML) has gained popularity but it is still under a process of consolidation30,31. Due to
this, diverse ontologies have been proposed to distinguish between approaches to model
explainability. A useful framework for navigating xAI can be laid out by distinguishing between
transparency, interpretability and explainability of ML models27.
Transparency refers to the understanding of the mechanisms by which a model works30. This
understanding can be of the entire model to the point that it could be simulated by a human,
the ability to decompose a model in its parts or to describe the whole process of a model to
generate an output31. As pointed out in 4.2, this level of understanding is a key aspect of
statistical inference and examples of such approaches can be seen in linear regression,
logistic regression, decision trees or principal component analysis (PCA). More sophisticated
approaches for transparent models have been proposed in the context of ML by the use of
symbolic regression32. In the context of medicine, risk scores that can be calculated by
clinicians are examples of transparent models.
Models with higher predictive performance tend to be opaquer since their complexity can
potentially accommodate the complexity of the data. To illuminate this, model interpretability
aims to present properties of an ML model in terms of human understanding27, i.e., mapping
an abstract concept (model prediction) into a domain that a human can make sense of33.
Approaches to ML interpretability can be model-specific, such as feature importance
16
methods for tree-based models, or model-agnostic (post hoc). Examples of model-agnostic
approaches have been proposed, for example, by creating proxy linear models locally34,
using Shapley values from game theory through SHAP values35 or mathematical
decomposition of neural networks by layer-wise relevance propagation (LRP)33.
Explainability of ML models is sometimes used interchangeably with interpretability, but some
authors have highlighted the differences29,33. Explainability emphasizes the consideration of
contextual information from domain knowledge related to the analysis goal to generate
explanations based on model interpretations which cannot be achieved only algorithmically27.
While model interpretability is purely descriptive of the model outputs in relation to their
inputs, model explainability accounts for causal notions by the user to explain not only how
but why the model could have provided a decision based on domain knowledge. Hence
explanations in the context of ML are subject to the same challenges as in other areas of
science which in this case make it dependent on the human-agent interaction29. Examples
of such challenges are limitations in the explanations due to current domain knowledge or
bias in selecting explanations among multiple possible ones. Nevertheless, generating
explanations based on accurate predictive models can potentiate scientific discoveries by
unravelling the complexity in data through Machine Learning.
In high-stake fields such as healthcare, the need for explanations of algorithms for clinical
decision-making is debated. Pragmatist positions defend that a good performing model
should be used as is done in other areas of medicine where certain drugs or interventions
have been used without full knowledge of the biological or clinical mechanisms behind36,37.
An argument for the need for xAI in healthcare is based on building trust in clinicians and
patients regarding the use of algorithms. We tend to rely more on human medical decisions
even in contexts where AI outperforms humans, but interventions have been proposed to
bridge the gap and encourage the use of algorithms38. Nevertheless, ML models can be
subject to biases39 or problems that require further inspection and scepticism about their
predictions for which not only model explainability can help but also good practices during
model development and implementation.
4.5. Artificial intelligence for patients with infectious diseases
AI applied to medicine has been increasingly improving healthcare over the years. Early
approaches from the 1970s used ruled-based approaches for diagnosis or signal
processing22. In recent years, the focus has been on Deep Learning applications for
processing images since the promising results in image classification observed since 201222.
Examples of these implementations can be seen in radiology where algorithms are tested to
process X-ray images and applications for other types of medical scans such as computed
tomography or imaging in pathology, dermatology, ophthalmology, gastroenterology or
cardiology. Despite such models being the least transparent, their results can be visually
assessed against experts where performance between clinicians and algorithms can be
compared37.
Other areas of medicine in which ML can offer solutions are in the context of health systems.
The increasing amount of health records that are digitalised allows for training ML models to
perform diverse tasks and predict clinical outcomes such as readmissions, mortality,
17
diagnosis or patient monitoring22. Most of these models have been developed and assessed
on retrospective data but have not been evaluated in clinical settings. This is important to
notice since some methodological limitations can prevent the success of these technologies.
Some of these challenges are the lack of predictions at the individual level for precise risk
assessment37 or certain biases when ignoring censored patients40 in medical AI (mAI).
In the context of infectious diseases, diverse solutions have been proposed based on
different data sources. Examples of these solutions are the development of models to detect
tuberculosis in X-rays41, diagnosis of infectious diseases based on structured data from
medical records42, early prediction of sepsis utilizing Natural Language Processing (NLP) in
unstructured data from clinical notes43 or predicting the risk of infection in patients with
immune dysfunction44. Most of the recent applications of ML for patients with infectious
diseases have been in the context of the SARS-CoV-2 pandemic where the crisis opened
the opportunity for technological innovation45.
4.6. Predictive models during the SARS-CoV-2 pandemic
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a single-stranded
RNA virus that causes coronavirus disease 2019 (COVID-19). Since its outbreak COVID-19
has resulted in over six million reported deaths worldwide46. The SARS-CoV-2 binds to the
angiotensin-converting enzyme 2 (ACE2) which is highly expressed in vascular endothelial
cells of the lungs47 making it a selective respiratory virus. It has been reported that up to 40%
of cases at the time of testing were asymptomatic or presymptomatic48. The most common
symptoms are cough, fever, fatigue, headache, sore throat and loss of smell49. Based on
estimates from Europe in 202150, approximately 6% of infected individuals would progress
into severe disease requiring hospitalization. Furthermore, 14.2% of hospitalised patients
would be admitted to the Intensive Care Unit (ICU) or require supportive oxygen therapy.
The overall mortality rate among COVID-19 cases was 1.5%. These numbers contrast with
those from Denmark in 2020 where a 20% hospitalization rate, 2.8% admission rate to an
Intensive care unit (ICU) and a 5.2% fatality rate within 30 days were reported based on
patients that were positive for SARS-CoV-251. The differences in these rates capture the
evolving nature of the pandemic in which not only mutations in the virus as new variants were
observed but also the impact of public health interventions such as vaccination and new
treatments. While these changes seem to point to better management of the pandemic in
terms of mortality, the fast-paced nature of such changes has challenged predictive models.
Since the start of the SARS-CoV-2 pandemic, at least 180 ML approaches have been
proposed to tackle various clinical challenges52. These include solutions from a population
level such as forecasting dynamics of the disease, predicting the impact of interventions or
outbreak detection to an individual level by modelling diagnosis, triage, or disease prognosis.
These models employ diverse data sources, including images e.g., radiography53,
unstructured text from clinicians or epidemiological data. To develop such models a
considerable amount of data is needed. However, due to the urgency of the pandemic, faster
solutions were required. Leveraging existing structured data in the form of EHR allowed to
not only access quickly vast amounts of relevant clinical features54 but also pave the way for
implementing ML models in the healthcare systems. In particular, EHR can be used to
develop prognosis prediction and multiple models based on traditional statistical frameworks
18
or ML methods have been reported. Nevertheless, most of these models were at high risk of
bias or poorly reported55,56. While some reasons for these issues are due to the quality and
quantity of the available data or the pre-selection of variables based on limited knowledge of
a new disease, some problems can be attributed to methodological assumptions and the
implementation of ML algorithms. The main methodological shortcomings of such studies
are the lack of explainability of the ML models and failure to consider censored individuals40,57
without relying on strong assumptions such as proportional hazards when performing
survival analysis26. Solutions to these issues are presented in Study II and explained in the
following chapters of this thesis (6.5.3 and 6.8).
4.7. Precision medicine of patients with infectious diseases
The areas of major interest in precision medicine have been in the context of prenatal
diagnostics, genetic diseases and oncology58. National initiatives have even emphasized
some of these areas such as oncology as a short-term focus to materialize the benefits of
precision medicine59. Nevertheless, despite this focus and scientific breakthroughs in the
area for the past 20 years, the implementation of these advancements in the clinic has
proved insufficient60.
Often overlooked, one area where precision medicine is having a great impact is in the
context of infectious diseases. Due to the early development of microbiology among the
biological sciences, the knowledge of the molecular mechanisms of pathogens coupled with
the increasing access to high-throughput techniques in the clinic has allowed the
development of better diagnostic tests61. An example of this has been not only the rapid
diagnosis of individuals infected by the SARS-CoV-2 virus but also the specific variant
involved62. For other pathogens, such as bacteria, a rapid diagnosis and specific
characterisation can inform treatment and reduce the risk of antibiotic resistance63. Rapid
and accurate pathogen characterisation and diagnosis is the initial step in which precision
medicine has impacted care. Once a pathogen has entered an organism, assessing the
disease progression is also necessary. This is especially relevant in the case of sepsis, one
of the major causes of death from any infectious disease, where the combination of quick
diagnosis and assessment of biomarkers is critical64.
While precision medicine maximizes its efforts in treating individuals, the development in the
field is also applicable at a population level. One important area of research has been the
understanding of genomic factors from hosts and their interplay with pathogens. In this
sense, the use of genomic approaches has proved increasingly beneficial in the context of
epidemiology65 and the widespread popularisation of GWAS has also expanded our
knowledge of infectious diseases66. Such genetic data to study infections has not only been
collected in the context of research studies but also through the analysis of consumer-based
genomic tests67. A particular disease for which global efforts have been put to find treatments
and understand the mechanisms of interaction with the host is the Human Immunodeficiency
Virus type 1 (HIV).
19
4.8. Host genomics in HIV infection
The Human Immunodeficiency Virus (HIV) type 1 is a single-stranded RNA retrovirus first
isolated in 198368 that has so far claimed 36.3 million [27.2–47.8 million] lives and by the end
of 2020, 37.7 million [30.2–45.1 million] people living with HIV (PLWH) were reported
according to the World Health Organization (WHO)69. HIV is transmitted via the exchange of
body fluids that can occur during sex, pregnancy, childbirth, or blood contact with infected
materials such as needles. Once in the organism, HIV targets mainly CD4+ T lymphocytes,
a type of immune cell, by binding to CD4 molecules and a co-receptor (CCR5 or CXCR4)70.
Primary HIV infection is characterised by an initial spike in HIV virions in the blood, measured
in terms of viral load (HIV-VL). However, due to a high mutation rate of the virus, HIV
eventually overcomes the immune response after a period of stability. This results in an
increase in HIV-VL and depletion of CD4+ T cell lymphocytes70. Due to these dynamics, the
levels of CD4+ T cells and HIV viral load (VL) measured in blood have been used as markers
of disease progression. Over time, the reduced levels of CD4+ T cells can result in acquired
immunodeficiency syndrome (AIDS) which can ultimately lead to death by opportunistic
infections when untreated70. Antiretroviral therapy (ART) has proved effective for nearly two
decades to suppress viral replication. Despite the available therapies cannot cure the
infection, early detection and treatment71 that is sustained can control the virus to
undetectable levels which equate to being untransmittable72.
Most of the achievements to fight the disease were accomplished by intensive research in
the lab and the clinic combined with public health initiatives. Over the years an increasing
interest arose to understand the susceptibility of individuals to control HIV for which basis
could be found in genetic markers. Multiple GWAS studies have been conducted to assess
diverse HIV-related outcomes with consistent findings of associations in the genes encoding
for the C-C chemokine receptor type 5 (CCR5) and the Human Leucocyte Antigen (HLA)73.
In particular, HLA alleles have been reported to explain up to 12% of the variability in HIV
VL74 by imposing pressure on viral genetic diversity which by itself can explain up to 20%-
46% of the variability in VL75. A common problem of GWAS is the lack of diversity in the
cohorts being predominantly of European ancestry76. This is relevant in the context of HIV
where the global pandemic has severely impacted Africa and cofounding effects could bias
the results of HIV studies due to population stratification and regional differences. In recent
years new studies have confirmed some of the previous findings in more diverse populations
not only at the host level3 but also reported interactions with viral genomes5. Nevertheless,
due to the high variability of the HLA region, the most variable region in the human genome77,
big sample sizes are required to study its effects. Despite some attempts that have been
made to increase the diversity of the cohorts78 we proposed the consideration of HLA
function to group similar alleles these issues by enriching the cohorts with useful information.
This approach aided by ML is presented in Study I.
20
5. Objectives
This thesis aims to present Machine Learning applications for the development of precision
medicine in patients with infectious diseases. This is outlined by proposing computational
solutions to two major challenges in precision medicine: how to infer relevant host genetic
factors in heterogeneous populations (Study I) and how to predict patient-specific risk while
accounting for censored individuals (Study II). For both challenges, the implemented models
are explained based on domain knowledge of biological systems and disease aetiology
supported by methods of model interpretability. This corresponds to the secondary aim of
developing not only predictive models but also deepening the understanding of HIV host
genomics and SARS-CoV-2 risk factors respectively. The specific objectives of each study
were:
Study I – Associations of functional HLA class I groups with HIV viral load in a heterogeneous
cohort.
To assess if functional clustering of the main host genetic factors involved in HIV control,
Human Leukocyte Antigen alleles, based on predicted binding affinities to HIV peptides
facilitate the study of HLA alleles in demographically heterogeneous cohorts.
Study II – Personalized survival probabilities for SARS-CoV-2 positive patients by explainable
machine learning
To implement survival machine learning models for predicting personalized 12-week
mortality of SARS-CoV-2 positive patients by leveraging electronic health records and
describing temporal dynamics of relevant risk factors through model explainability.
21
6. Methods
6.1. Data sources
6.1.1. The Strategic Timing of AntiRetroviral Treatment (START) cohort
Study I is based on an international cohort from the Strategic Timing of AntiRetroviral
Treatment (START) clinical trial run by the International Network for Strategic Initiatives in
Global HIV Trials (INSIGHT). This cohort consists of asymptomatic HIV+ and ART-naïve
individuals with two CD4+ cell counts >500/μL at least 14 days apart within 60 days of
enrolment in the trial71. The objective of the trial was to determine if early initiation of
antiretroviral therapy improved outcomes on asymptomatic HIV+ individuals while thoroughly
characterising such populations across the world. A total of 4685 patients were enrolled from
35 countries across 5 continents between 2009 and 2013. Genotypic and baseline data
were available for 2546 patients using Affymetrix Axiom SNP array3. Furthermore, viral HIV
RNA paired-end sequences were generated from 2079 patients.
6.1.2. Electronic Health Records in the context of the SARS-CoV-2 pandemic
In Study II, access to raw electronic health records from the Capital Region and Region
Zealand (eastern Denmark) was granted and supported by the Danish Ministry of Higher
Education and Science in the framework of the COVIMUN project. The EHR were provided
as data extracts from the electronic patient journal (EPJ) by EPIC systems. The final dataset
was composed of observations up to the 2nd of March 2021 with historical data available up
to at least 3 years before. 963,265 individuals over 18 years old were identified with a Real-
Time Polymerase Chain Reaction (RT-PCR) SARS-CoV-2 test taken at test sites connected
to the EHR system between the 17th of March 2020 and the 2nd of March 2021. The data
available contained demographics, hospitalizations, vital parameters, laboratory test results,
diagnoses and medicines (ordered and administered) used for routine care by clinicians.
Figure 1. Consort diagram of the SARS-CoV-2 cohort.
22
6.2. Ethical considerations
In Study I, access to the data in the START clinical trial (NCT00867048)71 was granted by
the International Network for Strategic Initiatives in Global HIV Trials (INSIGHT). Written
consent for the study and genetic analyses were obtained from the participants and
approved by participants’ site ethics review committees during the trial.
In Study II, approval to access EHR was provided by the Danish Regional Ethical Committee
in the Capital Region (H-20026502) and Data Protection Agency (P-2020-426) ensuring
compliance with the required ethical and legal regulations. Following the Danish law,
informed consent from patients can be waived given that approval from the Ethical
Committee is obtained before access to EHR for research purposes.
6.3. Encoding electronic health records for Machine Learning
To process biological and clinical information with an ML algorithm, data needs to be
encoded in meaningful feature vectors and clinical outcomes according to the prediction
task. In Study II, longitudinal data from raw EHR was used to engineer features and predict
the risk of death 12 weeks after a first positive SARS-CoV-2 test (FPT).
Multiple time windows and summary statistics can be used for feature engineering. We opted
for a simple approach to facilitate the interpretation of the final feature set. We encoded basic
characteristics such as age, sex, and body mass index (BMI) as the latest value observed
up to the day of FPT. For continuous values as in the case of vital parameters (e.g systolic
blood pressure) and laboratory test results (e.g blood cell counts), we considered the latest
value observed in the last month before the FPT. For categorical variables such as
diagnoses, in the form of International Statistical Classification of Diseases and Related
Health Problems version 10 codes (ICD-10), or medications, as Anatomical Therapeutic
Chemical (ATC) codes, we used the counts of such codes in the last three years and one
year respectively. Some extra features were added such as the number of weeks since the
start of the pandemic until the FPT was taken and an indicator if the patient was hospitalized
when the FPT was performed. Previous hospitalisations were included as a variable for
hospital stays longer than 24h encoded as cumulative days in hospital within the last three
years. Missingness was considered not at random hence imputation was not performed. For
diagnoses and medications, the lack of a code was encoded as zero and for continuous
variables such as laboratory values and vitals, missingness was codified as missing values.
6.4. Bioinformatics approaches for viral genomics
In Study I, in addition to host genomics, viral HIV RNA paired-end sequences were available
based on Illumina MiSeq sequencing covering two amplicons in the HXB2 genome
positioned 1485-5058 and 5967-9517. Preparation and quality control of this data has been
reported in a previous study5. While different methods exist for multiple alignments of viral
sequencing79, we opted for a simpler approach to generate putative epitopes from HIV
sequences. To do this the raw reads were fragmented using KAT80 to 27 bases long to then
translated into peptides of 9 amino acids long. This length corresponds to the average length
of HLA class I epitopes81. The resulting peptides were mapped to 10 major HIV proteins (Asp,
23
Gag-Pol, Nef, Vpr, Vpu, gp160, Vif, Pr55, Rev, Tat) using BLAST by considering exact
matches to the reference proteins extracted from NCBI with an E-value < 1E-05.
Alternatively, to generate a random peptidome, sequences 9 amino acids were collected by
processing half-million protein sequences from Uniprot.
6.5. Supervised Machine Learning
6.5.1. Human Leucocyte Antigen imputation
In Study I, genotypic data was used to impute classic HLA alleles given the relevance of
these loci in GWAS studies3. While multiple methods for HLA imputation have been
proposed, Machine Learning approaches have proved to be efficient when imputing multi-
ethnic populations82. The method of choice, HIBAG83, is based on a bagging approach where
an ensemble of classifiers is trained on SNP genotypes for which the HLA haplotypes are
known by bootstrapping samples and averaging the posterior probabilities of the predicted
HLA haplotypes. Pre-trained models on SNPs present in Affymetrix UK Biobank Axiom Arrays
using multi-ancestry data from multiple GlaxoSmithKline clinical trials and HapMap phase 2
were available. The model predicted the posterior probabilities of classic HLA class I (HLA-
A, HLA-B, and HLA-C) and class II (HLA-DP, HLA-DQ, and HLA-DR) alleles at 4-digit
resolution given the genotypes. The posterior probabilities measure the uncertainty of the
resulting predictions. Only HLA alleles with probabilities ≥0.5 were called setting the rest to
missing values.
6.5.2. Binding affinity prediction of HLA class I alleles to HIV peptides
For assessing functional similarities between HLA class I alleles we predicted their binding
affinities to peptides in the context of Study I. To do so we used NetMHCpan 4.081, a method
based on neural networks trained on extensive sets of peptides where the binding affinities
to HLA alleles have been experimentally assessed. This model has been reported as one of
the best-performing methods in the latest benchmarks84 with an accuracy close to the
experimental assessment of binding affinities. We predicted binding affinities of all 268 HLA
class I HLA alleles available in the model to the HIV and random peptides previously
described in 6.4. From the predicted immunopeptidomes, informative subsets were selected
based on (i) peptides from the top 10% binding affinities, (ii) peptides within the top 10% of
the variability in binding affinities across HLA class I alleles and (iii) binders to at least 10%
of the alleles defined by a binding affinity <500nM85.
6.5.3. Survival analysis by discrete-time modelling
It is common in clinical research that individuals leave a study before its end or follow-up of
the outcome of interest is discontinued. This is known as right-censoring and contrasts with
left-censoring which occurs when an event of interest has happened before the start of a
study but the time of when the event happened is unknown. A plethora of methods has been
proposed for inference of the time to an event or, more common in epidemiology, survival
analysis. The Kaplan-Meier estimator and the Cox proportional hazards model are examples
of such methods that can account for censoring. In the context of prediction with ML models,
a common approach is to remove the censored observations and perform binary
24
classification which has been reported to generate biased models40,57. Different ML
approaches have been suggested to perform survival analysis26 but due to the usage of
strong assumptions and complex loss functions the interpretability of these models is
challenging based on existing tools. The most common assumption is that survival time is
continuous and the events of interest can happen at any time point. However, when
measured, time is always discrete even if conceptualised as continuous. In contrast,
modelling approaches considering time as discrete present numerous advantages, primarily
they can provide a closer representation of the data based on how it was observed.
In Study II, we implemented discrete-time modelling to predict 12-week mortality in SARS-
CoV-2 positive patients after observing right-censoring in patients tested after the 8th of
December 2020 (12 weeks before the data extraction) for whom follow-up was not
completed. This approach, originally described by Cox as an approximation to his
proportional hazards assumption for continuous-time modelling86, allowed us by discretizing
time in predetermined intervals to train binary classifiers at each time interval87. This was
done by augmenting the data longitudinally by repeating each feature vector for an individual
as many times as time intervals the individual was observed (Figure 2). A variable
representing the time intervals was added as an input feature. The target value will contain
zeroes up to the row of the last time interval in which an individual was observed. If the patient
died it would be encoded as 1 and if the individual was alive as 0. When using the model for
new predictions, every individual was longitudinally augmented up to the maximum time of
prediction (12 weeks in our case) and accordingly indicated by the time feature.
Figure 2. Example of data transformations for discrete-time modelling.
25
The predicted probabilities of death at each time interval constitute the hazard function
which can be also expressed as a survival function and a cumulative density
function as defined below:
(1)
(2)
(3)
6.6. Unsupervised Machine Learning
Despite a target value is not used for unsupervised learning, metrics to measure similarity
between observations are needed. For Study I, the dissimilarity between HLA alleles,
represented as different sets of predicted binding affinities (see 6.5.2), was measured as
cosine, Pearson correlation and Euclidean distances between these feature vectors. The
resulting distance matrices were then processed using hierarchical clustering to generate
dendrograms of functionally related HLA alleles. To do so two distinct types of linkage
functions were used to agglomerate observations hierarchically. On one hand, average
linkage joins clusters with the shortest average distance between each other iteratively, on
the other hand, a Ward linkage merges clusters by minimizing the error sum-of-squares
(ESS). While sometimes overlooked, the Ward linkage requires distances in Euclidean space
hence the cosine and correlation matrices were corrected by their square root88.
6.6.1. Consensus clustering
When using ML methods, different algorithms, parameters and hyperparameters can yield
diverse valid solutions. To overcome this, ensemble learning has become a popular
approach in supervised machine learning in the form of bagging (e.g., in 6.5.1), boosting or
stacking. However, similar principles can be used with unsupervised learning to avoid bias
and generate better biological representations89. In the context of Study I, we implemented
consensus clustering90 to generate robust clusters based on the different subsets of
immunopeptidomes, distance metrics and linkage functions. A consensus matrix (Cij) of size
(n x n) was generated where each element contained the number of times an ith allele
clustered together with a jth allele when an increasing number of clusters were selected (3
to 160). After transforming the values into dissimilarity scores (1 – Cij), the consensus matrix
was then processed by hierarchical clustering with average linkage
6.7. Training and assessment of Machine Learning models
Estimating the parameters and assessing the performance of predictive models such as ML
models for supervised learning differs from data models described in 4.2. A strict assessment
of the hyperparameter optimisation and predictions in data unseen by the model is required.
Moreover, avoiding data leakage is critical to prevent ML algorithms from learning
26
information that would not be accessible otherwise when generating new predictions. This
would restrict the generalisation capabilities of the model and potentially lead to overfitting.
In Study II, we used cross-validation in two steps to split the data into different sets with the
same mortality and time distributions. First, 60% of the data (training set) was divided into 5
folds and used for training the parameters of gradient boosting decision tree models
(LightGBM91). Because of the class imbalance of the dataset, deaths were assigned more
weight in the algorithm through a positive class weight of 100. With the trained parameters,
feature selection was performed by assessing 20% of the data (validation set). Second, the
training and validation sets were combined, split into 5 folds and used to train the final
ensemble of 5 models based on the parameters and features optimized in the previous step.
The predictions of the ensemble were combined using the mean of predicted probabilities
and a threshold of 0.5 was used to generate binary classes. The remaining 20% of the data
(test set) was used for the final performance assessment. Binary metrics such as sensitivity,
specificity, the precision-recall area under the curve (PR-AUC) and Mathew correlation
coefficient (MCC) were computed for each predicted week by excluding censored
individuals in the calculations. These last two metrics are recommended when class
imbalance is present where metrics such as accuracy or the area under the receiver
operating characteristic (AUROC) can be biased92. The survival metric used was the
weighted concordance index (C-index) based on the inverse probability of censoring weights
computed across all 12-weeks93. 95% confidence intervals for the performance metrics were
generated by bootstrapping with resampling the generated predictions.
6.8. Explainability of Machine Learning models
As introduced in 4.2, inference has been traditionally used to arrive at biological and clinical
insights. Powerful predictive approaches through ML can be also leveraged for scientific
insights when coupled with strong domain knowledge and tools to interpret ML models. In
Study I, we show based on previous knowledge of HIV host genomics and biological
mechanisms of HLA alleles that ML approaches can be used to augment the existing data
and incorporate functional information. Despite the methods used for HLA imputation and
prediction of their binding affinities to peptides are not transparent since there are based on
bagging techniques and artificial neural networks, their interpretability would only be useful
to understand mechanisms at the SNP and amino acid levels. The accurate predictions of
such models combined with statistical analyses were sufficient to generate meaningful
explanations.
In Study II, methods for model interpretability were required to explain the resulting ML
models based on current clinical knowledge. From the different approaches proposed for
model interpretability94 we opted to use SHapley Additive exPlanations (SHAP). Based on
cooperative game theory, SHAP values provide local explanations with theoretical
guarantees for their local accuracy and consistency95. The explanations provided account
for the contribution of each feature to the predicted value in each individual prediction. To do
so, calculating SHAP values is computationally expensive since accounting for multiple
combinations of features and values is needed. We employed TreeSHAP95, an adaptation of
SHAP for tree-based ML methods to overcome some of these previous limitations by
27
computing SHAP values in polynomial time and accounting for dependency between
features. Apart from providing the contribution of each feature to the hazard function h(t|x)
described in 6.5.3, SHAP values were used for feature selection by calculating the
mean(|SHAP|) for all features and removing those with a value lower than a pre-specified
(0.01 in our case).
6.9. Statistical analyses
Descriptive statistics were employed to summarise the cohorts in both studies and to
generate features or aggregate predictions. While this thesis focused on ML methods,
statistical modelling was employed in Study I to test associations of HLA alleles and functional
HLA clusters with VL. Each node in the resulting dendrogram from consensus clustering
described in 6.6.1 was tested using linear regression adjusting by sex, self-reported race,
and country for any associations with log10-transformed HIV-1 viral load. Due to the number
of tests performed, multiple testing correction19 by a Benjamin-Hochberg procedure was
applied to reduce Type I errors. Associations were considered when an adjusted p-value (q-
value) < 0.05 was observed.
6.10. Software and visualization tools
Open-source tools have been prioritised for conducting the studies outlined in this thesis to
facilitate access and reproducibility of the results. Data wrangling and statistical analyses
were primarily performed using R96 supported by the tidyverse library97. Machine learning
methods have been implemented in Python supported by the pandas98 and numpy99 libraries
using gradient boosting decision trees implemented in LightGBM91. Model performance was
assessed using the libraries scikit-Learn100 and scikit-survival101. Summary tables were
generated using tableone102.
Regarding visualization and dissemination of the results, open-access and interactive results
were prioritized. In Study I, dendrograms and association coefficients were depicted using
Interactive Tree of Life (iTOL)103. Tanglegrams to compare diverse clustering approaches
were generated by the dendextend
R package104. The results are presented through a web
application using shiny105 accessible at bit.ly/HLA_dendogram. Likewise, in Study II, the
resulting ML model and code are available in a public repository at bit.ly/COVIMUN_DT
28
7. Summary of results
The main results of Study I and II are outlined below. For a full description of the methods
and results for each study please find the full manuscripts attached in the Manuscripts
section.
7.1. Study I
Associations of functional HLA class I groups with HIV viral load in a
heterogeneous cohort
Due to the challenges of studying important genotypic variants in multi-ancestry populations,
we assessed if accounting for functional similarities in HLA class I alleles could facilitate
association testing of HLA with HIV-1 viral load (VL) in a heterogeneous cohort. To do so we
employed supervised and unsupervised Machine Learning techniques for HLA imputation,
binding affinity prediction to peptides and consensus clustering of HLA alleles.
HLA alleles were imputed from genotypes of 2546 HIV+ participants. The reported accuracy
of the imputation in out-of-bag samples was >90% for all HLA loci. In a subset of 1122
participants tested for HLA-B*57:01 only 2 false positives and 2 false negatives were
observed based on our imputation. For HLA class I the calling rates were 95.5% for HLA-A,
82.8% for HLA-B and 96.4% for HLA-C as described in detail in a previous study3.
We predicted the binding affinities of 268 HLA class I alleles to 173,792 peptides derived
from 2079 HIV samples and half-million random peptides. We applied consensus clustering
to the generated immunopeptidomes to group HLA alleles based on functional similarities.
The resulting HLA clusters were used to test associations with HIV-1 viral load by linear
regression and adjusted by sex, self-reported race, and country (Figure 3).
Figure 3. Graphical abstract of Study I.
We found four HLA clusters associated with HIV-VL composed of 30 HLA class I alleles of
which 11 were observed in participants from the START cohort. On one hand, two of these
nodes were associated with a lower VL: one cluster composed of HLA-B*57:01, B*58:01,
B*57:02, and B*57:03 (β -0.25, q-value 7.02E-06) and a second cluster composed of HLA-
29
C*08:04 and C*08:01 (β -0.29, q-value 0.042). On the other hand, two nodes were
associated with higher VL: one cluster composed of six HLA-B*44 alleles, B*44:05, B*44:08,
B*44:04, B*44:03, B*44:02, B*44:27 (β 0.15, q-value 0.003) and a cluster composed of 16
alleles: B*35:20, B*35:16, B*35:10, B*35:43, B*35:08, B*35:19, B*35:41, B*35:01,
B*35:17, B*35:05, B*44:06, B*56:03, B*53:01, B*15:08 and B*15:11 (β 0.13, q-value 0.048)
(Figure 4). Only two HLA class I alleles from the reported clusters (HLA-B*57:01 and
B*57:03) were associated with HIV-VL when tested independently.
Figure 4. Dendrogram of HLA class I alleles clustered based on binding affinities to HIV
peptides. Associations, defined by an adjusted p-value < 0.05, are represented as thick
branches for nodes and black triangles for leaves. White triangles indicate HLA alleles
imputed in our cohort. The effect of the respective associations is colour-coded from
protective effect (blue) to detrimental (red). Green bars on the outer ring reflect HLA allele
counts. An interactive version is available at bit.ly/HLA_dendogram
While HIV-specific immunopeptidomes were used for the reported associations they only
reflected a small statistical power increase compared to HLA clusters based on random
immunopeptidomes. Overall, the HLA clusters accounted for 11.44% of the explained
variance in HIV VL after adjustment by adjustment for sex, self-reported race, and country
as similarly reported in homogeneous cohorts of European ancestry.
30
7.2. Study II
Personalized survival probabilities for SARS-CoV-2 positive patients by
explainable machine learning
In the context of the COVID-19 pandemic, we developed ML models to predict mortality
within 12 weeks of a first positive SARS-CoV-2 test (FPT). We considered 33,938 patients
who had at least one SARS-CoV-2 RT-PCR positive test performed between the 17th of March
2020 and the 2nd of March 2021. 1,803 (5.34%) deaths were observed of which 141
happened after 12 weeks from FPT.
We generated 2723 features from electronic health records encoding information about
demographics, diagnoses, medications, laboratory test results and vital parameters. After
feature selection, only 22 features were sufficient for the final model. We performed discrete-
time modelling using gradient boosting decision trees (LightGBM91) to account for censoring
of patients for which an FPT was taken after the 8th of December 2020 hence presenting an
uncompleted 12-week follow-up. By modelling the time explicitly in an ML framework, we
could predict patient-specific survival probabilities that could be explained in terms of 22
clinical features and time since FPT (Figure 5)
Figure 5. Graphical abstract of Study II. Panel a. depicts the period and sources represented
in the dataset. Panel b. displays the feature engineering process of values up to a first positive
test. Panel c. illustrates the Machine Learning approach for survival analysis.
31
The performance of our model for predicting the risk of death for all 12 weeks measured in
20% of the data (test set) showed a weighted C-index with 95% confidence intervals (CI) of
0.946 (0.941-0.950). At week 12, PR-AUC and MCC with 95% CI were 0.686 (0.651-0.720)
and 0.580 (0.562-0.597) respectively. This performance was reflected in a high
discriminatory power when looking at discrete and cumulative probabilities of death for each
individual aggregated as the median (Figure 6 a, b) or as individual examples (Figure 6 c, d).
Figure 6. Predicted individual discrete and cumulative death probabilities.
When SHAP values were computed for feature selection and model interpretability, high
values of features such as age, BMI, sex (male), and clinical factors such as the number of
unique prescribed medications and diagnosis codes manifested as the top risk factors
(Figure 7a). By including time as a variable following the discrete-time modelling approach
we could explore the temporal dynamics of individual risk factors. Age, ordered loop
diuretics, and admission at the time of FPT had a higher impact on the risk of dying early,
while BMI, diagnosis of Alzheimer’s disease, and ordered B-vitamin were relevant for late risk
(Figure 7b). The proposed ML approach not only captured relevant risk factors that differed
over time but also across individuals (Figure 7c-d).
The model could capture non-linear contributions of the features to the risk of death at 12
weeks. Partial dependency plots (PDP) revealed that age contributes to the risk of death
over 60 years of age (Figure 8a) and a higher risk of mortality in patients with BMI lower than
30 (Figure 8b). A higher risk of death was also seen for patients with low lymphocyte count
(Figure 8d). Finally, we found that the number of ordered medicines was a better predictor
of death than the number of diagnoses, where patients with less than five ordered
medications in the last year showed up to 10% less risk of death compared to patients with
more than 20 ordered medications with up to 40% higher risk (Figure 8h).
32
Figure 7. Global and local explanations of predictive features of 12-week mortality in SARS-
CoV-2 positive patients. The top row depicts agglomerated SHAP values representing global
explanations (a) and weekly global explanations (b). The bottom row illustrates predicted
discrete probabilities of death and the contribution of the features for two patients as an
example of personalized survival probabilities with their corresponding risk factors (c-d)
Figure 8. Partial dependence plots of relevant features in SARS-CoV-2 12-week mortality
prediction by survival status.
33
8. Discussion
The two studies presented in this thesis aim to tackle important challenges in precision
medicine of patients with infectious diseases. The solutions based on Machine Learning
approaches and the findings revealed by explaining the resulting models raise important
methodological, biological and clinical considerations.
8.1. Imputation and functional clustering of HLA alleles by Machine Learning
models
The central role of HLA molecules in the adaptative immune response has been recognised
not only in vitro but also in vivo. The study of HLA alleles in GWAS has been challenging due
to the extreme polymorphism and strong linkage disequilibrium in the HLA loci77. Big sample
sizes are required and difficulties arise when studying populations of diverse ancestry, with
the added problem of the cost of characterising HLA alleles in big cohorts. To facilitate this,
in Study I we imputed HLA alleles for 2,546 HIV+ participants based on SNP array data. The
imputed HLA alleles have proved valuable not only for this study but for the other relevant
scientific contributions listed in this thesis exploring the individual associations of these alleles
with HIV VL3, with clinical outcomes4 or host-viral genomic interactions5. Likewise, multiple
computational approaches have been proposed to investigate HLA alleles in the context of
HIV-1 infection: genome-to-genome analyses106, exploring new epitopes107 or assessing the
interactions of HLA class I molecules to other ligands such as killer immunoglobulin-like
receptors (KIRs)108.
We proposed clustering HLA alleles based on shared function under the reasonable
assumption that the majority of HLA functionality is mediated by their capacity to bind and
present epitopes. This allowed us to account for the high genetic diversity of the cohort and
viral sequences by predicting the binding affinities of HLA class I alleles to HIV-specific and
unspecific immunopeptidomes. Viral diversity was accounted for using a fast k-mer approach
to process all HIV sequences and avoid generating a consensus sequence. This is important
for viruses with high genetic variability such as HIV that are shaped by the selective pressure
inside the host109. Predicted binding affinities to HIV and unspecific peptides were used for
clustering. While clustering based on a Bayesian framework has been presented in a
previous study110, the results were limited by the homogeneity of the data and assumptions
of the clustering parameters. We opted for consensus clustering using different subsets of
immunopeptidomes to avoid bias from the substantial number of non-binders predicted that
can drive the computed distances among HLA alleles towards zero. Using a filtering and
ensemble approach helped us to mitigate issues due to high dimensionality and has been
successful when applied to biological data89.
8.1.1. Associations of functional HLA class I clusters with HIV viral load
A key parameter and potential source of bias in unsupervised learning is the estimation of
the number of final clusters or a threshold to split different entities. The dissimilarity matrices
based on predicted immunopeptidomes from Study I were measured on a continuous scale
hence defining discrete clusters was a problem. To overcome this and test associations of
34
functional clusters with HIV VL we processed the dissimilarity matrices from consensus
clustering by hierarchical clustering allowing us to test associations of each node in the
dendrogram from the leaves (individual HLA class I alleles) up to the roots (HLA functional
nodes). This process combined support for nodes based on their association with VL when
the combination of alleles in all child nodes allowed it. Compared to traditional HLA
association studies challenged by the polymorphism of the locus and diverse ancestries, we
could, by aggregating functionally related alleles, increase the statistical signal and uncover
associations that could not be found at the individual allele due to sample size111. Strict
multiple testing was applied to avoid false positives from the increasing number of statistical
tests.
We found four functional HLA class I nodes associated with HIV-VL accounting for 30 HLA
alleles. Only two of these alleles could be detected when performing independent allele-
specific tests. Some have been previously reported when tested individually112 as for the case
of alleles HLA-B*57:01, B*58:01, B*57:02, and B*57:03. By reporting their common
functionality and combined protective effect we could analyze these alleles together despite
their different frequencies between populations. We also found a protective effect of a
functional node of two alleles, HLA-C*08:04 and C*08:01, confirming findings from other
studies on admixed populations113. While most of our results were in alignment with previous
reports, we found a node of six HLA-B*44 alleles associated with a higher VL that contradicts
a previous study in a Chinese cohort114 for which alleles present in the cluster showed a
protective effect. We hypothesize this could be due to a regional adaptation but the lack of
participants for that region did not allow us to explore this further. Nevertheless, our analysis
could explain in a genetically diverse cohort 11.44% of the explained variance in HIV VL similar
to estimates reported in European74 and multi-ancestry cohorts78. Estimating the contribution of
genetic factors to outcomes of interest while sometimes overlooked is critical to assess the
impact of such variables in precision medicine.
8.2. Explainable Machine Learning for survival models in precision medicine
In the last decade, algorithmic modelling has become more prevalent for predictive tasks
such as prognosis or diagnosis in biomedical research22,37. This has been more evident
during the COVID-19 pandemic where at least 180 ML solutions have been proposed52.
Nevertheless, most of the models in the literature show a high risk of bias due to their design
or poor reporting55,56.
In Study II, we developed a prognostic ML model to predict the risk of death within the first
12 weeks from the first positive SARS-CoV-2 PCR test based on Electronic Health records
of 33,938 patients from Eastern Denmark. We proposed solutions to reported concerns
regarding prognostic models with a focus on model explainability. First, to model longitudinal
data, we implemented a discrete-time approach86,87 that has been shown to achieve
performance as good or better than continuous-time models115,116 such as regularized Cox
models or Random Survival Forests. By doing so, we could model all the individuals in our
cohort, even those for which the outcome was not observed due to a lack of complete follow-
up (12 weeks) at the time when the data extract was generated. This allowed us to avoid
selection bias and underestimation of predicted risks reported in other prognostic models by
not removing censored observations40,57. Because no proportionality of hazards was
35
assumed, our model could predict personalized survival probabilities117 for each patient using
binary classifiers at each time interval, leveraging existing algorithms such as gradient
boosting decision trees91. Furthermore, our model could handle missing values without
imputation and account for the class imbalance due to the low proportion of events in the
model by positive class weighting. Certain areas such as uncertainty estimation of individual
predictions or assessment of calibration could be further improved in our proposed modelling
approach.
To gain insights from the complexity learned by the models118 we used methods for model
interpretability in the context of explainable artificial intelligence (xAI). Most of these methods
operate by removing or altering the values of the features in the model94 to open the “black
box”. Some of these approaches have been used in clinical models for different diseases44,119
including COVID-19120. In some cases, the insights from model explainability revealed biases
and helped improve clinical models39. In Study II, we used SHapley Additive exPlanations
(SHAP)95 values to explain the contribution of the features included in the model to the
predicted survival probability given the specific context of the patient. In addition, since we
included time as a feature, we explain the temporal dynamics of such contributions which
have not been studied previously. We argue that these local explanations are necessary for
the implementation of machine learning models in precision medicine since they reveal
patient-specific risk factors. When combined, global explanations can be derived29,121
generalizing important risk factors involved in the prognosis, in this case, of SARS-CoV-2
positive patients. It is important to mention that the features explained and selected for their
predictive power do not necessarily imply causal effects21,52 of such features in mortality.
While informative, different sets of features could be equally predictive due to equally
performing solutions found during the training of ML models122
8.2.1. Risk factors in SARS-CoV-2+ patients through model explainability
Electronic health records are a valuable resource that can be used to create prognostic
models. Since EHR are optimized for routine care compared to studies in which data is
collected following a protocol, leveraging such information requires careful processing123 of
the outcome to model and encoding of the features to include. In Study II, we chose to model
the outcome of death within the first 12 weeks from the positive SARS-CoV-2 PCR test. All-
cause mortality was the most reliable outcome available in our dataset. While other outcomes
might be more clinically relevant such as hospitalization, oxygen use or admission to an ICU,
the reported cause of these events by COVID-19 was uncertain. Moreover, this reporting
evolved during the pandemic incurring data shifts that even implied changes in national
policy in Denmark124. To provide a more informative outcome we explored sustained
recovery as presented in another study6 which could be used for further predictive models.
A basic approach was used to encode 2723 features from EHR of demographics, laboratory
test results, hospitalizations, vital parameters, diagnoses and medicines. Feature encoding
presents as a combinatorial problem for which multiple solutions have been proposed to
generate meaningful representations125,126. We encoded the latest values or counts in
clinically relevant time windows prior to FPT depending on the data type facilitating the
interpretation of the model. Nevertheless, more optimal solutions could be available by, for
example, including feature trajectories.
36
From explaining our model, we found older age127, sex (male)128 and obesity129 as important
risk factors for mortality in SARS-CoV-2+ patients in alignment with previous studies. Age
was especially relevant in individuals over 60 years old due to age-related factors such as an
increased prevalence of comorbidities that our Machine Learning approach could capture
by modelling interactions. We observed that the number of medications prescribed or
administered was more informative than diagnosis codes suggesting a better proxy for
comorbidities130. Lymphocytopenia was also identified as a relevant risk factor that may
represent an immune dysfunction not only by COVID-19131 but also, by ongoing therapy or
malignancies.
We reported the temporal dynamics and interactions of relevant features. We observed a
higher risk of death in the first four weeks since FPT, probably representing the period during
which the infection was still active47. We could also distinguish between risk factors for early
vs late mortality. Factors such as being hospitalized at the time of FPT, the week since the
start of the pandemic in which the prediction was made, age and administration of loop
diuretics were important factors for early death. Other factors such as lower BMI, diagnosis
of Alzheimer’s disease, and ordered B-vitamin explained the risk of late death (>8 weeks)
probably indicating frail patients with a profound disease progression hence not
recommended for ICU or mechanical ventilation. Some of these features indicate latent
variables not encoded in the features but impacting the outcome such as an improvement
of care due to an evolved understanding of the disease during the pandemic132. To encode
this trend, we included the week since the start of the pandemic as a feature. Patients
infected early rather than later in the pandemic had a higher risk of dying. This reflects the
need to not only focus on the data available in EHR but also to encode meaningful variables
informed by domain knowledge or the environment that could be highly predictive.
8.3. Strengths and limitations
The main strengths of this thesis reside in the methodology proposed and the large cohorts
of HIV+ and SARS-CoV-2+ individuals used for its assessment. The proposed ML
approaches provide solutions to key aspects in the precision medicine of patients with
infectious diseases while maximizing the performance and explainability of the resulting
models.
The functional analysis of HLA alleles in Study I allowed us to study one of the most
polymorphic regions in the human genome with a critical role in infections77, especially HIV133.
We demonstrate that the imputation of HLA alleles through ML is a cost-efficient and
accurate approach to infer HLA types in big and geographically diverse cohorts such as in
the START cohort. Coupled with accurate binding affinity predictions of HLAs to peptides
and consensus clustering to assess functional HLA groups, the presented approach is
neither specific to HIV nor viral load. By providing open access to our results, we facilitate
the study of HLAs and outcomes in other host-pathogen interactions based on already
computed HLA class I functional clusters.
In terms of the methods, some of the limitations of Study I can be found in the assumption of
the functionality of HLA alleles and the approach to infer relevant HLA alleles and functional
clusters. While most of the function of HLA molecules comes from presenting epitopes
37
through the binding of relevant peptides, they can also interact with other ligands that affect
their function such as KIRs which we have not accounted for108. Regarding the inference of
relevant HLA alleles or groups, we used linear models which do not account for non-linear
effects. This model choice is nevertheless the standard approach for testing associations. In
terms of the genetic data available, we lacked participants from Asian countries while also
missing extra participants from more African regions71.
We propose a complete framework for explainable prognostic models based on electronic
health records and Machine learning. As presented in Study II, we could account for non-
linear effects, handle missing values, relax assumptions such as proportional hazards when
modelling right-censored individuals, learn interactions between variables and explain
temporal dynamics of risk factors while improving discriminative performance. To our
knowledge, this modelling approach accounts for many of the drawbacks of previous
models, especially in the context of the COVID-19 pandemic56. By leveraging EHR based on
a data-driven selection of 22 variables our publicly available model could be tested for
implementation enabling precision medicine in routine care.
Some aspects of our proposed modelling framework could be improved. Assessing the
calibration of the models and measuring the uncertainty of individual predictions was
attempted but not accomplished in the discrete-time framework. Arbitrary time intervals were
chosen that could be further optimized to perform the discretization of time. It is worth noting
that the proposed approach requires more effort in processing the data into the required
input for the discrete-time approach. The main limitations of Study II are related to the
available data. Despite representing real-world data (RWD) with big sample sizes, some
uncertainty in EHR only allowed us to trust certain outcomes and conditions for the model to
be reliable. The prediction point at the first positive SARS-CoV-2 test limited the usage of the
model at the following times. Also, the lack of confirmations of the cause of death in SARS-
CoV-2+ individuals only permitted us to model all-cause mortality instead of those caused by
COVID-19. Due to the fast evolution of the pandemic, data shifts were present134, for
example, our model did not account for vaccinated individuals which now corresponds to
most of the population in Denmark. These limitations nevertheless can be solved by
retraining the models and validating them in external datasets to assess the generalizability
of predictions in other healthcare systems.
38
9. Conclusion
The work conducted in this thesis demonstrates the implementation of Machine Learning
methods for precision medicine of patients with infectious diseases. We presented in two
studies how these novel methods can be used to study challenging genetic regions in diverse
populations and to generate precise prognostic models. When supported by model
explainability, ML models not only provided accurate predictions but also insights into
complex mechanisms involved in HIV genetics and disease progression in SARS-CoV-2
infections.
In Study I, we imputed HLA alleles and predicted their binding affinities to HIV and unspecific
peptides. This allowed us to group functionally related HLA alleles that would be differentially
represented across populations hence facilitating association testing of HLA alleles with HIV
viral load in cohorts of diverse ancestry.
In Study II, we employed electronic health records to develop data-driven models of 12-week
mortality in SARS-CoV-2 positive patients. We show how discrete-time modelling can be
leveraged by existing approaches for binary classification. Modelling the time explicitly as a
variable allowed us to unravel the temporal dynamics of personalized risk factors for each
patient. This allowed us to not only train survival models that account for censored patients
with great performance but also accommodate existing techniques for model explainability.
While the ML approaches presented in this thesis have been applied in the context of HIV
and SARS-CoV-2 infections, the study of HLA alleles through their function and the
development of explainable survival models can be applied to other diseases. The
methodology presented not only allows for a deeper understanding of diseases by explaining
the resulting models but also addresses two of the major tenets in precision medicine: the
estimation of patient-specific risks and refined characterisations of individuals by accounting
for relevant genetic factors in clinical models.
10. Future perspectives
Given the stage of development in which precision medicine is still immersed, multiple
considerations can be taken regarding its future. One of the main concerns is the challenge
to translate the findings from the laboratory to clinical practice7,60,135, a cycle colloquially
denominated “from the bench to the bedside”. A lot of the roadblocks to precision medicine
can be found in the lack of logistical and educational progress in the field. To overcome these
issues, multiple national initiatives are focusing on building the necessary infraestructures11,59
to support the richer data from the high-throughput characterisation of patients and data
lakes to enable data-drivel clinical models as in the case of PERSIMUNE136. Furthermore,
new educational programs on precision medicine are available for clinical practitioners and
researchers137. Nevertheless, important aspects of clinical research regarding
epistemological assumptions and methodological considerations9 have to be improved to
grant robust clinical models that could sustain their transition into the clinic. These aspects
relate to how and what to model in precision medicine.
39
10.1. How to model: from predictive to causal models
The increasing use of algorithmic models in statistical modelling and Machine Learning is not
only improving the predictive power of clinical models but also their acceptance into clinical
practice when aided by methods for model explainability. Prospective assessment of such
models and benchmarking against standard of care and medical practitioners are still
required to understand their clinical impact37. Some studies have shown some success by
algorithmic models reaching the accuracy of medical experts138. However, when assessed
prospectively, models trained in retrospective data can experience distribution shifts and
loose performance over time134. An explanation for this can be found in the way that statistical
modelling and Machine learning has been designed so far. Modelling based on correlations
or associations has left the assessment of causality for experiments and clinical trials. In
recent years, causal inference is gaining popularity as an approach to quantifying the causal
effects of variables of interest. In the context of genomics, Mendelian randomization has been
proposed to measure the causal effects of genetic variation on different outcomes139. But
causal principles can be also applied in other areas. Principles of causal understanding such
as counterfactuals can be used in conjunction with Machine Learning algorithms. This ties
closely with the field of model explainability since most of our explanations are based on
counterfactual thinking29. When these principles are applied, they can provide more robust
and accurate clinical models140 that could stand the process of deployment in clinical
practice.
10.2. What to model: data-driven models and complexity
Traditional clinical models have been developed by considering a restricted set of variables
of interest, selected based on previous domain knowledge. These variables were usually
chosen from available data in routine care or studies exploring the explanatory potential of
new variables. If new clinical variables improved the existing models, these would be then
measured and incorporated into routine care. With the digitalization of medical records and
the accessibility of high-throughput techniques, the amount of available data is growing
exponentially. Big data processed by algorithms such as in Machine Learning enables a new
way of modelling, data-driven, where all available information can be used to predict an
outcome of interest. The selection of variables can be then informed by the algorithm by
selecting the most informative ones that could later be explained into new insights. This
approach of considering all available data when developing models can also inspire new
ways to not only account for all the already collected information but find ways to encode
complex variables that reflect environmental or psychosocial features. The need for
modelling complexity in medicine13 can be supported by studies quantifying the impact of
genetic and environmental factors on diverse traits in which an overestimation of heritability
in complex diseases has been reported141. In some cases, environmental or unknown non-
genetic factors can account for up to 70% of variability as in the case of HIV set-point viral
load133 Exploring non-genetic variables of clinical relevance such as environmental and
psychosocial factors not only could improve the accuracy of clinical models but also expose
new candidates for interventions. By better understanding individuals and their environment
is that precision medicine could manifest the etymological roots of “health” referring to
“whole”, providing a complete and inclusive picture of the complexity of medicine.
40
11. References
1. Zucco, A. G.
et al.
Associations of functional HLA class I groups with HIV viral load in a
heterogeneous cohort. 2022.06.21.22276431 Preprint at
https://doi.org/10.1101/2022.06.21.22276431 (2022).
2. Zucco, A. G.
et al.
Personalized survival probabilities for SARS-CoV-2 positive patients by
explainable machine learning. 2021.10.28.21265598 Preprint at
https://doi.org/10.1101/2021.10.28.21265598 (2021).
3. Ekenberg, C.
et al.
Association Between Single-Nucleotide Polymorphisms in HLA Alleles and
Human Immunodeficiency Virus Type 1 Viral Load in Demographically Diverse, Antiretroviral
Therapy–Naive Participants From the Strategic Timing of AntiRetroviral Treatment Trial.
J
Infect Dis
220, 1325–1334 (2019).
4. Ekenberg, C.
et al.
The association of human leukocyte antigen alleles with clinical disease
progression in HIV-positive cohorts with varied treatment strategies.
AIDS
35, 783–789
(2021).
5. Gabrielaite, M.
et al.
Human immunotypes impose selection on viral genotypes through viral
epitope specificity.
The Journal of Infectious Diseases
(2021) doi:10.1093/infdis/jiab253.
6. Moestrup, K. S.
et al.
Readmissions, post-discharge mortality and sustained recovery among
patients admitted to hospital with COVID-19.
7. McGrath, S. & Ghersi, D. Building towards precision medicine: empowering medical
professionals for the next revolution.
BMC Medical Genomics
9, 23 (2016).
8. König, I. R., Fuchs, O., Hansen, G., Mutius, E. von & Kopp, M. V. What is precision medicine?
European Respiratory Journal
50, (2017).
9. Gombar, S., Callahan, A., Califf, R., Harrington, R. & Shah, N. H. It is time to learn from patients
like mine.
npj Digit. Med.
2, 1–3 (2019).
10. Hasin, Y., Seldin, M. & Lusis, A. Multi-omics approaches to disease.
Genome Biology
18, 83
(2017).
11. Njølstad, P. R.
et al.
Roadmap for a precision-medicine initiative in the Nordic region.
Nature
Genetics
1 (2019) doi:10.1038/s41588-019-0391-1.
12. Time to reality check the promises of machine learning-powered precision medicine.
The
Lancet Digital Health
2, e677–e680 (2020).
13. Miles, A. Complexity in medicine and healthcare: people and systems, theory and practice.
Journal of Evaluation in Clinical Practice
15, 409–410 (2009).
14. Shmueli, G. To Explain or to Predict?
Statist. Sci.
25, 289–310 (2010).
15. James, G., Witten, D., Hastie, T. & Tibshirani, R.
An introduction to statistical learning
. vol. 112
(Springer, 2013).
16. Breiman, L. Statistical Modeling: The Two Cultures (with comments and a rejoinder by the
author).
Statist. Sci.
16, 199–231 (2001).
17. Clarke, G. M.
et al.
Basic statistical analysis in genetic case-control studies.
Nature Protocols
6,
121–133 (2011).
18. Baker, M. Statisticians issue warning over misuse of
P
values.
Nature News
531, 151 (2016).
19. Goeman, J. J. & Solari, A. Tutorial in biostatistics: multiple hypothesis testing in genomics. in
(2012).
20. Turing, A. M. On Computable Numbers, with an Application to the Entscheidungsproblem.
Proceedings of the London Mathematical Society
s2-42, 230–265 (1937).
21. Goodfellow, I., Bengio, Y. & Courville, A.
Deep Learning
. (MIT Press, 2016).
22. Yu, K.-H., Beam, A. L. & Kohane, I. S. Artificial intelligence in healthcare.
Nature Biomedical
Engineering
2, 719 (2018).
23. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning.
Nature
521, 436–444 (2015).
41
24. Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating
errors.
Nature
323, 533–536 (1986).
25. Hastie, T., Tibshirani, R., Friedman, J. H. & Friedman, J. H.
The elements of statistical learning:
data mining, inference, and prediction
. vol. 2 (Springer, 2009).
26. Wang, P., Li, Y. & Reddy, C. K. Machine Learning for Survival Analysis: A Survey.
arXiv:1708.04649 [cs, stat]
(2017).
27. Roscher, R., Bohn, B., Duarte, M. F. & Garcke, J. Explainable Machine Learning for Scientific
Insights and Discoveries.
IEEE Access
8, 42200–42216 (2020).
28. Kaminski, M. E. & Malgieri, G. Algorithmic impact assessments under the GDPR: producing
multi-layered explanations.
International Data Privacy Law
11, 125–144 (2021).
29. Miller, T. Explanation in artificial intelligence: Insights from the social sciences.
Artificial
Intelligence
267, 1–38 (2019).
30. Lipton, Z. C. The Mythos of Model Interpretability.
arXiv:1606.03490 [cs, stat]
(2016).
31. Belle, V. & Papantonis, I. Principles and Practice of Explainable Machine Learning.
arXiv:2009.11698 [cs, stat]
(2020).
32. Udrescu, S.-M. & Tegmark, M. AI Feynman: A physics-inspired method for symbolic
regression.
Science Advances
6, eaay2631 (2020).
33. Montavon, G., Samek, W. & Müller, K.-R. Methods for Interpreting and Understanding Deep
Neural Networks.
arXiv:1706.07979 [cs, stat]
(2017).
34. Ribeiro, M. T., Singh, S. & Guestrin, C. ‘Why Should I Trust You?’: Explaining the Predictions of
Any Classifier.
arXiv:1602.04938 [cs, stat]
(2016).
35. Lundberg, S. & Lee, S.-I. A Unified Approach to Interpreting Model Predictions.
arXiv:1705.07874 [cs, stat]
(2017).
36. Wang, F., Kaushal, R. & Khullar, D. Should Health Care Demand Interpretable Artificial
Intelligence or Accept “Black Box” Medicine?
Ann Intern Med
172, 59–60 (2020).
37. Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence.
Nature Medicine
25, 44 (2019).
38. Cadario, R., Longoni, C. & Morewedge, C. K. Understanding, explaining, and utilizing medical
artificial intelligence.
Nat Hum Behav
5, 1636–1642 (2021).
39. Caruana, R.
et al.
Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital
30-day Readmission. in
Proceedings of the 21th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining
1721–1730 (ACM, 2015).
doi:10.1145/2783258.2788613.
40. Li, Y., Sperrin, M., Ashcroft, D. M. & Staa, T. P. van. Consistency of variety of machine learning
and statistical models in predicting clinical risks of individual patients: longitudinal cohort study
using cardiovascular disease as exemplar.
BMJ
371, (2020).
41. Lakhani, P. & Sundaram, B. Deep Learning at Chest Radiography: Automated Classification of
Pulmonary Tuberculosis by Using Convolutional Neural Networks.
Radiology
284, 574–582
(2017).
42. Wang, M., Wei, Z., Jia, M., Chen, L. & Ji, H. Deep learning model for multi-classification of
infectious diseases from unstructured electronic medical records.
BMC Medical Informatics
and Decision Making
22, 41 (2022).
43. Goh, K. H.
et al.
Artificial intelligence in sepsis early prediction and diagnosis using unstructured
data in healthcare.
Nat Commun
12, 711 (2021).
44. Agius, R.
et al.
Machine learning can identify newly diagnosed patients with CLL at high risk of
infection.
Nature Communications
11, 1–17 (2020).
45. Ardito, L., Coccia, M. & Messeni Petruzzelli, A. Technological exaptation and crisis
management: Evidence from COVID19 outbreaks.
R&D Management
10.1111/radm.12455
(2021) doi:10.1111/radm.12455.
42
46. Organization, W. H. & others. Weekly epidemiological update on COVID-19 - 8 June 2022.
CoV-weekly-sitrep8Jun21-eng. pdf. pdf
.
47. Matheson, N. J. & Lehner, P. J. How does SARS-CoV-2 cause COVID-19?
Science
369, 510–
511 (2020).
48. Sah, P.
et al.
Asymptomatic SARS-CoV-2 infection: A systematic review and meta-analysis.
Proceedings of the National Academy of Sciences
118, e2109229118 (2021).
49. Lechien, J. R.
et al.
Clinical and epidemiological characteristics of 1420 European patients with
mild-to-moderate coronavirus disease 2019.
Journal of Internal Medicine
288, 335–344
(2020).
50. Clinical characteristics of COVID-19.
European Centre for Disease Prevention and Control
https://www.ecdc.europa.eu/en/covid-19/latest-evidence/clinical.
51. Reilev, M.
et al.
Characteristics and predictors of hospitalization and death in the first 11 122
cases with a positive RT-PCR test for SARS-CoV-2 in Denmark: a nationwide cohort.
International Journal of Epidemiology
49, 1468–1481 (2020).
52. Syrowatka, A.
et al.
Leveraging artificial intelligence for pandemic preparedness and response:
a scoping review to identify key use cases.
npj Digit. Med.
4, 1–14 (2021).
53. Roberts, M.
et al.
Common pitfalls and recommendations for using machine learning to detect
and prognosticate for COVID-19 using chest radiographs and CT scans.
Nat Mach Intell
3,
199–217 (2021).
54. Izcovich, A.
et al.
Prognostic factors for severity and mortality in patients infected with COVID-
19: A systematic review.
PLOS ONE
15, e0241955 (2020).
55. Wynants, L.
et al.
Prediction models for diagnosis and prognosis of covid-19 infection:
systematic review and critical appraisal.
BMJ
369, (2020).
56. Navarro, C. L. A.
et al.
Risk of bias in studies on prediction models developed using supervised
machine learning techniques: systematic review.
BMJ
375, n2281 (2021).
57. Vock, D. M.
et al.
Adapting machine learning techniques to censored time-to-event health
record data: A general-purpose approach using inverse probability of censoring weighting.
Journal of Biomedical Informatics
61, 119–131 (2016).
58. Milo Rasouly, H., Aggarwal, V., Bier, L., Goldstein, D. B. & Gharavi, A. G. Cases in Precision
Medicine: Genetic Testing to Predict Future Risk for Disease in a Healthy Patient.
Ann Intern
Med
174, 540–547 (2021).
59. Collins, F. S. & Varmus, H. A New Initiative on Precision Medicine.
New England Journal of
Medicine
372, 793–795 (2015).
60. Lancet, T. 20 years of precision medicine in oncology.
The Lancet
397, 1781 (2021).
61. Caliendo, A. M.
et al.
Better Tests, Better Care: Improved Diagnostics for Infectious Diseases.
Clin Infect Dis
57, S139–S170 (2013).
62. Chavda, V. P., Patel, A. B. & Vaghasiya, D. D. SARS-CoV-2 variants and vulnerability at the
global level.
Journal of Medical Virology
94, 2986–3005 (2022).
63. Moser, C.
et al.
Antibiotic therapy as personalized medicine – general considerations and
complicating factors.
APMIS
127, 361–371 (2019).
64. Rello, J.
et al.
Towards precision medicine in sepsis: a position paper from the European
Society of Clinical Microbiology and Infectious Diseases.
Clinical Microbiology and Infection
24, 1264–1272 (2018).
65. Ladner, J. T., Grubaugh, N. D., Pybus, O. G. & Andersen, K. G. Precision epidemiology for
infectious disease control.
Nat Med
25, 206–211 (2019).
66. Thorball, C. W., Fellay, J. & Borghesi, A. Immunological lessons from genome-wide association
studies of infections.
Current Opinion in Immunology
72, 87–93 (2021).
67. Tian, C.
et al.
Genome-wide association and HLA region fine-mapping studies identify
susceptibility loci for multiple common infections.
Nature Communications
8, 599 (2017).
43
68. Barré-Sinoussi, F.
et al.
Isolation of a T-Lymphotropic Retrovirus from a Patient at Risk for
Acquired Immune Deficiency Syndrome (AIDS).
Science
220, 868–871 (1983).
69. HIV/AIDS. https://www.who.int/health-topics/hiv-aids.
70. Deeks, S. G., Overbaugh, J., Phillips, A. & Buchbinder, S. HIV infection.
Nat Rev Dis Primers
1,
1–22 (2015).
71. Insight Start Study Group. Initiation of Antiretroviral Therapy in Early Asymptomatic HIV
Infection.
New England Journal of Medicine
373, 795–807 (2015).
72. Rodger, A. J.
et al.
Risk of HIV transmission through condomless sex in serodifferent gay
couples with the HIV-positive partner taking suppressive antiretroviral therapy (PARTNER):
final results of a multicentre, prospective, observational study.
The Lancet
393, 2428–2438
(2019).
73. Naranbhai, V. & Carrington, M. Host genetic variation and HIV disease: from mapping to
mechanism.
Immunogenetics
69, 489–498 (2017).
74. Bartha, I.
et al.
Estimating the Respective Contributions of Human and Viral Genetic Variation to
HIV Control.
PLOS Computational Biology
13, e1005339 (2017).
75. Fraser, C.
et al.
Virulence and Pathogenesis of HIV-1 Infection: An Evolutionary Perspective.
Science
343, (2014).
76. Sirugo, G., Williams, S. M. & Tishkoff, S. A. The Missing Diversity in Human Genetic Studies.
Cell
177, 26–31 (2019).
77. Dendrou, C. A., Petersen, J., Rossjohn, J. & Fugger, L. HLA variation and disease.
Nat Rev
Immunol
18, 325–339 (2018).
78. Luo, Y.
et al.
A high-resolution HLA reference panel capturing global population diversity
enables multi-ancestry fine-mapping in HIV host response.
Nat Genet
53, 1504–1516 (2021).
79. Moshiri, N. ViralMSA: massively scalable reference-guided multiple sequence alignment of viral
genomes.
Bioinformatics
37, 714–716 (2021).
80. Mapleson, D., Garcia Accinelli, G., Kettleborough, G., Wright, J. & Clavijo, B. J. KAT: a K-mer
analysis toolkit to quality control NGS datasets and genome assemblies.
Bioinformatics
33,
574–576 (2017).
81. Jurtz, V.
et al.
NetMHCpan-4.0: Improved Peptide–MHC Class I Interaction Predictions
Integrating Eluted Ligand and Peptide Binding Affinity Data.
The Journal of Immunology
199,
3360–3368 (2017).
82. Pappas, D. J.
et al.
Significant variation between SNP-based HLA imputations in diverse
populations: the last mile is the hardest.
The Pharmacogenomics Journal
18, 367–376 (2018).
83. Zheng, X.
et al.
HIBAG—HLA genotype imputation with attribute bagging.
The
Pharmacogenomics Journal
14, 192–200 (2014).
84. Reynisson, B., Alvarez, B., Paul, S., Peters, B. & Nielsen, M. NetMHCpan-4.1 and
NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif
deconvolution and integration of MS MHC eluted ligand data.
Nucleic Acids Research
48,
W449–W454 (2020).
85. Thomsen, M., Lundegaard, C., Buus, S., Lund, O. & Nielsen, M. MHCcluster, a method for
functional clustering of MHC molecules.
Immunogenetics
65, 655–665 (2013).
86. Cox, D. R. Regression Models and Life-Tables.
Journal of the Royal Statistical Society. Series B
(Methodological)
34, 187–220 (1972).
87. Tutz, G. & Schmid, M.
Modeling Discrete Time-to-Event Data
. (Springer International
Publishing, 2016). doi:10.1007/978-3-319-28158-2.
88. van Dongen, S. & Enright, A. J. Metric distances derived from cosine similarity and Pearson
and Spearman correlations.
arXiv:1208.3145 [cs, stat]
(2012).
89. Ronan, T., Qi, Z. & Naegle, K. M. Avoiding common pitfalls when clustering biological data.
Sci.
Signal.
9, re6–re6 (2016).
44
90. Strehl, A. & Ghosh, J. Cluster Ensembles — a Knowledge Reuse Framework for Combining
Multiple Partitions.
J. Mach. Learn. Res.
3, 583–617 (2003).
91. Ke, G.
et al.
LightGBM: A Highly Efficient Gradient Boosting Decision Tree.
Advances in Neural
Information Processing Systems
30, (2017).
92. Chicco, D. & Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over
F1 score and accuracy in binary classification evaluation.
BMC Genomics
21, 6 (2020).
93. Uno, H., Cai, T., Pencina, M. J., D’Agostino, R. B. & Wei, L. J. On the C-statistics for evaluating
overall adequacy of risk prediction procedures with censored survival data.
Stat Med
30,
1105–1117 (2011).
94. Covert, I., Lundberg, S. & Lee, S.-I. Explaining by Removing: A Unified Framework for Model
Explanation.
Journal of Machine Learning Research
22, 1–90 (2021).
95. Lundberg, S. M.
et al.
From local explanations to global understanding with explainable AI for
trees.
Nat Mach Intell
2, 56–67 (2020).
96. R Core Team.
R: A Language and Environment for Statistical Computing
. (R Foundation for
Statistical Computing, 2019).
97. Wickham, H.
et al.
Welcome to the tidyverse.
Journal of Open Source Software
4, 1686 (2019).
98. team, T. pandas development. pandas-dev/pandas: Pandas 1.3.3. (2021)
doi:10.5281/zenodo.5501881.
99. Harris, C. R.
et al.
Array programming with NumPy.
Nature
585, 357–362 (2020).
100. Pedregosa, F.
et al.
Scikit-learn: Machine Learning in Python.
Journal of Machine Learning
Research
12, 28252830 (2011).
101. Pölsterl, S. scikit-survival: A Library for Time-to-Event Analysis Built on Top of scikit-learn.
Journal of Machine Learning Research
21, 1–6 (2020).
102. Pollard, T. J., Johnson, A. E. W., Raffa, J. D. & Mark, R. G. tableone: An open source Python
package for producing summary statistics for research papers.
JAMIA Open
1, 26–31 (2018).
103. Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v4: recent updates and new
developments.
Nucleic Acids Res
47, W256–W259 (2019).
104. Galili, T. dendextend: an R package for visualizing, adjusting and comparing trees of
hierarchical clustering.
Bioinformatics
31, 3718–3720 (2015).
105. Chang, W.
et al.
shiny: Web Application Framework for R
. (2021).
106. Bartha, I.
et al.
A genome-to-genome analysis of associations between human genetic
variation, HIV-1 sequence diversity, and viral control.
eLife Sciences
2, e01123 (2013).
107. Arora, J.
et al.
HIV peptidome-wide association study reveals patient-specific epitope
repertoires associated with HIV control.
PNAS
116, 944–949 (2019).
108. Debebe, B. J.
et al.
Identifying the immune interactions underlying HLA class I disease
associations.
eLife
9, e54558 (2020).
109. Fellay, J. & Pedergnana, V. Exploring the interactions between the human and viral genomes.
Hum Genet
(2019) doi:10.1007/s00439-019-02089-3.
110. Dorp, C. van & Kesmir, C. Estimating HLA disease associations using similarity trees.
bioRxiv
408302 (2018) doi:10.1101/408302.
111. Kennedy, A. E., Ozbek, U. & Dorak, M. T. What has GWAS done for HLA and disease
associations?
International Journal of Immunogenetics
44, 195–211 (2017).
112. Goulder, P. J. R. & Walker, B. D. HIV and HLA Class I: An Evolving Relationship.
Immunity
37,
426–440 (2012).
113. Valenzuela-Ponce, H.
et al.
Novel HLA class I associations with HIV-1 control in a unique
genetically admixed population.
Scientific Reports
8, 6111 (2018).
114. Zhang, X.
et al.
HLA-B*44 Is Associated with a Lower Viral Set Point and Slow CD4 Decline in
a Cohort of Chinese Homosexual Men Acutely Infected with HIV-1.
Clin Vaccine Immunol
20,
1048–1054 (2013).
45
115. Kvamme, H. & Borgan, Ø. Continuous and Discrete-Time Survival Prediction with Neural
Networks.
arXiv:1910.06724 [cs, stat]
(2019).
116. Sloma, M., Syed, F., Nemati, M. & Xu, K. S. Empirical Comparison of Continuous and Discrete-
time Representations for Survival Prediction. in
Proceedings of AAAI Spring Symposium on
Survival Prediction - Algorithms, Challenges, and Applications 2021
118–131 (PMLR, 2021).
117. Haider, H., Hoehn, B., Davis, S. & Greiner, R. Effective Ways to Build and Evaluate Individual
Survival Distributions.
Journal of Machine Learning Research
21, 1–63 (2020).
118. Roscher, R., Bohn, B., Duarte, M. F. & Garcke, J. Explainable Machine Learning for Scientific
Insights and Discoveries.
arXiv:1905.08883 [cs, stat]
(2019).
119. Lauritsen, S. M.
et al.
Explainable artificial intelligence model to predict acute critical illness
from electronic health records.
Nature Communications
11, 3852 (2020).
120. Yan, L.
et al.
An interpretable mortality prediction model for COVID-19 patients.
Nature
Machine Intelligence
1–6 (2020) doi:10.1038/s42256-020-0180-7.
121. Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R. & Yu, B. Interpretable machine
learning: definitions, methods, and applications.
arXiv:1901.04592 [cs, stat]
(2019).
122. Molnar, C.
et al.
General Pitfalls of Model-Agnostic Interpretation Methods for Machine
Learning Models.
arXiv:2007.04131 [cs, stat]
(2021).
123. Jensen, P. B., Jensen, L. J. & Brunak, S. Mining electronic health records: towards better
research applications and clinical care.
Nature Reviews Genetics
13, 395–405 (2012).
124. Statens Serum Institut. Typical misinformation regarding Danish COVID-numbers.
https://en.ssi.dk/covid-19/typical-misinformation-regarding-danish-covid-numbers.
125. Landi, I.
et al.
Deep representation learning of electronic health records to unlock patient
stratification at scale.
npj Digit. Med.
3, 1–11 (2020).
126. Li, Y.
et al.
BEHRT: Transformer for Electronic Health Records.
Sci Rep
10, 7155 (2020).
127. Gao, Y. dong
et al.
Risk factors for severe and critically ill COVID-19 patients: A review.
Allergy: European Journal of Allergy and Clinical Immunology
76, 428–455 (2021).
128. Zhang, J.
et al.
Risk factors for disease severity, unimprovement, and mortality in COVID-19
patients in Wuhan, China.
Clin Microbiol Infect
26, 767–772 (2020).
129. Gao, F.
et al.
Obesity Is a Risk Factor for Greater COVID-19 Severity.
Diabetes Care
(2020)
doi:10.2337/dc20-0682.
130. Guan, W.
et al.
Comorbidity and its impact on 1590 patients with COVID-19 in China: a
nationwide analysis.
European Respiratory Journal
55, (2020).
131. Cippà, P. E.
et al.
A data-driven approach to identify risk profiles and protective drugs in
COVID-19.
PNAS
118, (2021).
132. Benfield, T.
et al.
Improved Survival Among Hospitalized Patients With Coronavirus Disease
2019 (COVID-19) Treated With Remdesivir and Dexamethasone. A Nationwide Population-
Based Cohort Study.
Clinical Infectious Diseases
(2021) doi:10.1093/cid/ciab536.
133. Tough, R. H. & McLaren, P. J. Interaction of the Host and Viral Genome and Their Influence on
HIV Disease.
Front. Genet.
9, (2019).
134. Zhang, A., Xing, L., Zou, J. & Wu, J. C. Shifting machine learning for healthcare from
development to deployment and from models to data.
Nat. Biomed. Eng
1–16 (2022)
doi:10.1038/s41551-022-00898-y.
135. Burgner, D., Jamieson, S. E. & Blackwell, J. M. Genetic susceptibility to infectious diseases:
big is beautiful, but will bigger be even better?
Lancet Infect Dis
6, 653–663 (2006).
136. Centre of excellence for personalised medicine of infectious complications in immune
deficiency. https://www.persimune.dk/.
137. Videreuddannelse, S. Master i personlig medicin. https://personligmedicin.ku.dk/ (2020).
138. Gunčar, G.
et al.
An application of machine learning to haematological diagnosis.
Scientific
Reports
8, 411 (2018).
46
139. Sanderson, E.
et al.
Mendelian randomization.
Nat Rev Methods Primers
2, 1–21 (2022).
140. Richens, J. G., Lee, C. M. & Johri, S. Improving the accuracy of medical diagnosis with causal
machine learning.
Nature Communications
11, 3923 (2020).
141. Muñoz, M.
et al.
Evaluating the contribution of genetics and familial shared environment to
common disease using the UK Biobank.
Nat Genet
48, 980–983 (2016).
47
12. Manuscripts
Manuscript of Study I
Associations of functional HLA class I groups with HIV viral load in a
heterogeneous cohort.1
Adrian G. Zucco, Marc Bennedbæk, Christina Ekenberg, Migle Gabrielaite, Preston Leung,
Mark N. Polizzotto, Virginia Kan, Daniel D. Murray, Jens D. Lundgren and Cameron R.
MacPherson for the INSIGHT START study group.
medRxiv, June 2022.
(Submitted to AIDS)
1
Associations of functional HLA class I groups with HIV viral load in a
heterogeneous cohort.
Adrian G. ZUCCO1, Marc BENNEDBÆK2, Christina EKENBERG1, Migle GABRIELAITE3, Preston LEUNG1, Mark N.
POLIZZOTTO4, Virginia KAN5, Daniel D. MURRAY1, Jens D. LUNDGREN1 and Cameron R. MACPHERSON1 for
the INSIGHT START study group.
1PERSIMUNE Center of Excellence, Rigshospitalet, Copenhagen, Denmark
2Virus Research and Development Laboratory, Virus and Microbiological Special Diagnostics, Statens Serum
Institut, Copenhagen, Denmark
3Center for Genomic Medicine, Copenhagen University Hospital, Copenhagen, Denmark.
4Clinical Hub for Interventional Research, College of Health and Medicine, The Australian National University,
Canberra, Australia
5George Washington University, Veterans Affairs Medical Center, Washington D.C, U.S.A.
Corresponding author
Adrian Gabriel Zucco, MSc, PhD student
Tel: +45 35 45 57 75
mail: adrian.gabriel.zucco@regionh.dk
Rigshospitalet, Copenhagen University Hospital
Centre of Excellence for Health, Immunity and Infections (CHIP) & PERSIMUNE
Blegdamsvej 9, DK-2100 Copenhagen Ø, Denmark
2
Abstract
Human Leucocyte Antigen (HLA) class I alleles are the main host genetic factors involved in controlling HIV-1
viral load (VL). Nevertheless, HLA diversity has proven a significant challenge in association studies. We
assessed how accounting for binding affinities of HLA class I alleles to HIV-1 peptides facilitate association
testing of HLA with HIV-1 VL in a heterogeneous cohort from the Strategic Timing of AntiRetroviral Treatment
(START) study. We imputed HLA class I alleles from host genetic data (2,546 HIV+ participants) and sampled
immunopeptidomes from 2,079 host-paired viral genomes (targeted amplicon sequencing). We predicted
HLA class I binding affinities to HIV-1 and unspecific peptides, grouping alleles into functional clusters through
consensus clustering. These functional HLA class I clusters were used to test associations with HIV VL. We
identified four clades totalling 30 HLA alleles accounting for 11.4% variability in VL. We highlight HLA-B*57:01
and B*57:03 as functionally similar but yet overrepresented in distinct ethnic groups, showing when
combined a protective association with HIV+ VL (log, β -0.25; adj. p-value < 0.05). We further demonstrate
only a slight power reduction when using unspecific immunopeptidomes, facilitating the use of the inferred
functional HLA groups in other studies. The outlined computational approach provides a robust and efficient
way to incorporate HLA function and peptide diversity, aiding clinical association studies in heterogeneous
cohorts. To facilitate access to the proposed methods and results we provide an interactive application for
exploring data.
3
Introduction
The Human Leucocyte Antigen (HLA) is a critical component of the host immune response. HLA Class I alleles
mediate the anti-viral response through the presentation of intracellular viral peptides for recognition by
Cytotoxic T cells. This mechanism is critical to the host’s defence against diverse pathogens, which is why it
is among the most genetically diverse regions in the human genome as evidenced by its association with our
variable response to infectious disease [1]. In the context of HIV, HLA alleles are the canonical host genetic
factors associated with viral load (VL)[2], altogether contributing up to 12% of its variability [3]. Viral diversity,
on the other hand, is thought to explain two to four times as much (20%-46%) [4]. The combined relevance
of both host and viral genomics suggests the necessity for the simultaneous analysis of both in the context
of infectious disease[5]. Different approaches have been used to study this interaction such as genome-to-
genome [6] or peptidome-wide associations [7]. In addition, when considering the interplay between host
HLA alleles and viral peptides, their association could be explained functionally in terms of epitope binding
and presentation.
Analysis of genetic variance within the HLA region and its associations with clinical outcomes is challenged
by population-dependent distributions and the diversity of HLA haplotypes. Reaching statistical power within
this region is difficult and while accounting for the effects of this important immunological region is
necessary, it is often ignored altogether. Yet, the function of HLA is conserved as a core component of the
immune response [8], [9]. This suggests there may be shared functional features within and between
populations, even in the presence of genotypic plasticity. Understanding the structure of this mutual
information could prove relevant for the study of host-pathogen interactions where high mutation rates are
observed such as in HIV [10].
Immunopeptidomes have been used to estimate functional similarities between HLA alleles in what was
initially denominated HLA supertypes [11]. These functional groups are based on the propensity of various
alleles to bind similar sets of peptides. The algorithms used to estimate these binding profiles have gradually
improved over time [12], [13]. However, the consideration of HLA functional groups in the context of
Genome-Wide Association Studies (GWAS) has largely been overlooked [14].
In this study, we considered the ability of HLA proteins to functionally bind peptides leveraging the cost-
effectiveness of high-throughput genotyping as opposed to directly assaying HLA function. We provide a
framework based on state-of-the-art computational methods for the study of predicted immunopeptidomes
and functional HLA groups in the context of HIV-1 infection. In doing so, we demonstrate the increased
statistical power that is gained by moving away from a purely genetic approach to a functional one with
implications for studying the immune response either directly or as a confounder. We processed HIV
immunopeptidomes that incorporate intra-host viral diversity and imputed HLA alleles from the same
geographically diverse cohort of people living with HIV (PLWH) and antiretroviral therapy (ART) naïve
participants from the Strategic Timing of Anti-Retroviral Treatment (START) trial [16]. Interactive results of
this work and complementary data are available via a web application ( https://persimune-health-
informatics.shinyapps.io/PAW2022Zucco__HLA_HIV_INSIGHT/).
Methods
Ethics
Host and viral samples in this study were extracted and analyzed from participants in the START clinical trial
(NCT00867048) [16], conducted by the International Network for Strategic Initiatives in Global HIV Trials
(INSIGHT) and the Community Programs for Clinical Research on AIDS (CPCRA). Written consent for the study
4
and genetic analyses were obtained from the participants and approved by participants’ site ethics review
committees.
HLA class I alleles
Imputation of HLA class I alleles (HLA-A, HLA-B and HLA-C) for 2546 genotyped ART-naïve, HIV+ participants
was performed with HIBAG at 4-digit resolution [17]. Full details of the imputation process and quality control
are described in previous publications [2], [18]. A multi-ethnic pre-trained model was used for imputation
with a minimum out-of-bag accuracy of over 90% for all loci. HLA diversity was measured by the inverse
Simpson index based on the HLA allele frequencies per locus and country. This index represents the
complement of the probability that two participants would have the same HLA allele for a selected locus in
a country.
Immunopeptidomes and binding affinity prediction
Plasma samples were obtained for a subset of 2079 ART-naïve, HIV+ participants from 21 countries enrolled
in the START study. Viral RNA was sequenced, paired-end, using Illumina MiSeq and covered two amplicons
in the HXB2 genome positioned 1485-5058 and 5967-9517. The sample preparation, library preparation,
sequencing procedure, and detailed quality controls have been described previously [19]. Raw reads were
fragmented into 27-mers using KAT [20] and those with a count higher than 1 were translated into peptide
sequences of 9 amino acids length to fit the mean length of HLA Class I epitopes. Peptides were mapped to
10 major HIV proteins (Asp, Gag-Pol, Nef, Vpr, Vpu, gp160, Vif, Pr55, Rev, Tat) from NCBI RefSeq NC001802.1
(Supplementary file 1) using BLAST 2.8.1 (blastp-short). Hits with an E-value > 1E-05 were excluded to remove
low-quality k-mers by considering exact matches. To compare HIV peptidomes with random sequences, a set
of half-million 9-mers were generated by processing the same number of random protein sequences from
Uniprot. Binding affinities for 268 class I HLA alleles to both HIV and random peptidomes using NetMHCpan
4.0 [15]. Three different immunopeptidome subsets were generated by (i) selecting peptides from the top
10% binding affinities, (ii) 10% most variable peptides, and (iii) peptides that potentially bind to at least 10%
of the alleles using <500nM as general threshold [12].
Consensus clustering
Hierarchical clustering was implemented using two different linkage functions. Average linkage was used for
measuring relationships between HLA alleles represented by dissimilarity defined as cosine, correlation and
Euclidean distances. For clustering based on Ward linkage, cosine and correlation distances were corrected
by the square root to satisfy the triangular inequality necessary to operate in Euclidean space [21]. To
generate an ensemble of clustering solutions, we employed consensus clustering to mitigate bias from the
subset, distance metric, or chosen linkage function [22]. A consensus matrix (Cij) of size (n x n) was built where
each element is the number of times an ith allele clustered together with a jth allele [23] at a varying total
number of clusters selected (3 to 160). The consensus matrix was then processed by hierarchical clustering
with average linkage after transforming the values into dissimilarity scores (1 – Cij). Clustering was performed
in Python 3.7.1 using the Scipy library [24].
Statistical analyses
Associations of log10-transformed VL with each node of the consensus tree were tested by linear regression
and adjusted by sex, self-reported race, and country. Tested HLA alleles had to be present in more than 10%
of the participants. Multiple testing was controlled by a Benjamini-Hochberg procedure using a q-value <
0.05 to identify associations. Analyses were performed in R v3.6.0 [25].
5
Data visualization and availability
Consensus clustering dendrograms and association coefficients were depicted using Interactive Tree of Life
(iTOL) [26] and tanglegrams, which were generated by the dendextend R package [27]. We provide a flexible
visualization of peptide-to-HLA binding profiles across viral proteins. Data downloads and access to
supplementary materials are made available through the app (https://persimune-health-
informatics.shinyapps.io/PAW2022Zucco__HLA_HIV_INSIGHT/ ).
Results
Baseline characteristics for the START cohort
We used baseline data and the genotypes of 2,546 participants from the START trial. Next-generation
sequencing of HIV samples was retrieved for a subset of 2,079 participants. All participants included in the
trial were asymptomatic HIV+ and ART-naïve with two CD4+ cell counts >500/μL at least 14 days apart within
60 days of enrollment in the trial. Baseline characteristics for study participants can be found in Table 1.
Prediction of patient-derived HIV immunopeptidomes
Our approach for immunopeptidome generation initially yielded 9.88 × 107 peptide 9-mers. After filtering
and mapping with BLAST to ten major HIV proteins, a total of 173,792 9-mers were considered. This
accounted for a 99.22% coverage across the reference proteome. Among the final list, we observed 136 best-
defined CTL/CD8+ epitopes from Los Alamos (version 2019-11-20; Supplementary file 2).
Data exploration
To facilitate the exploration of the results, a web application was developed (Figure 1) providing tools to
navigate interactively the global and local diversity of imputed HLA alleles, HIV subtype-derived peptides,
and their corresponding binding profiles. Shannon and Simpson diversity indices can be directly visualized on
the world map highlighting the geographical diversity of the cohort. This interactive tool facilitates the further
examination of specific HLA allele, HIV subtype, or peptide frequencies for the entire cohort and options to
explore details at a country level.