ThesisPDF Available

Explainable Machine Learning for Precision Medicine of Patients with Infectious Diseases

Authors:

Abstract

This thesis aims to present Machine Learning applications for the development of precision medicine in patients with infectious diseases. This is outlined by proposing computational solutions to two major challenges in precision medicine: how to infer relevant host genetic factors in heterogeneous populations (Study I) and how to predict patient-specific risk while accounting for censored individuals (Study II). For both challenges, the implemented models are explained based on domain knowledge of biological systems and disease aetiology supported by methods of model interpretability. This corresponds to the secondary aim of developing not only predictive models but also deepening the understanding of HIV host genomics and SARS-CoV-2 risk factors respectively. The specific objectives of each study were: Study I – Associations of functional HLA class I groups with HIV viral load in a heterogeneous cohort. To assess if functional clustering of the main host genetic factors involved in HIV control, Human Leukocyte Antigen alleles, based on predicted binding affinities to HIV peptides facilitate the study of HLA alleles in demographically heterogeneous cohorts. Study II – Personalized survival probabilities for SARS-CoV-2 positive patients by explainable machine learning. To implement survival machine learning models for predicting personalized 12-week mortality of SARS-CoV-2 positive patients by leveraging electronic health records and describing temporal dynamics of relevant risk factors through model explainability.
U NI VE R S IT Y O F CO PE NH AG EN
F A C U L TY O F H E A L TH A N D M E D IC A L SC I E N CE S
Explainable Machine Learning for
Precision Medicine of
Patients with Infectious Diseases
PhD Thesis
Adrian Gabriel Zucco, MSc
This t hesi s has been submitted to the Graduate School of Health and M edic al S ci en ces,
Un iv ersi ty of Copenhagen o n the 5th of A ug ust 2022 .
1
U NI VE R S IT Y O F CO PE NH AG EN
F A C U L TY O F H E A L TH A N D M E D I CA L S CI E N C ES
EXPLAINABLE MACHINE LEARNING FOR
PRECISION MEDICINE OF
PATIENTS WITH INFECTIOUS DISEASES
PhD Thesis
Adrian Gabriel Zucco
2
Strømmen fanger dig.
Svøm MED strømmen aldrig MOD strømmen,
som kan træke dig ned.
Når strømmen bliver svagere -
Svøm rundt i en bue tilbage mod land.
” The current catches you.
Swim WITH the current never AGAINST the current
as it may pull you down.
When the current becomes weaker,
Swim towards the shore in a curve.”
- Inspirational sign at a beach in Nationalpark Vadehavet
3
Studies
This thesis is based on two studies referred to by roman numerals I - II
I.
Associations of functional HLA class I groups with HIV viral load in a
heterogeneous cohort.1
Adrian G. Zucco, Marc Bennedbæk, Christina Ekenberg, Migle Gabrielaite,
Preston Leung, Mark N. Polizzotto, Virginia Kan, Daniel D. Murray, Jens D.
Lundgren and Cameron R. MacPherson for the INSIGHT START study group.
medRxiv, June 2022.
(Submitted to AIDS)
II.
Personalized survival probabilities for SARS-CoV-2 positive patients by
explainable machine learning.2
Adrian G. Zucco, Rudi Agius, Rebecka Svanberg, Kasper S. Moestrup, Ramtin Z.
Marandi, Cameron Ross MacPherson, Jens Lundgren, Sisse R. Ostrowski*,
Carsten U. Niemann*
*Co-senior authors.
medRxiv, Oct. 2021.
(Accepted at Scientific Reports)
Correspondence
Adrian Gabriel Zucco, MSc.
CHIP, Centre of Excellence for Health Immunity and Infections,
PERSIMUNE, Centre of Excellence, Rigshospitalet, University of Copenhagen, Denmark
Telephone: +45 35 45 57 75
e-mail: adrian.gabriel.zucco@regionh.dk
Financial support
This PhD has been supported by the Danish National Research Foundation (DNRF126) and
a COVID-19 grant from the Ministry of Higher Education and Science (0238-00006B) while
conducting research at CHIP (Centre of Excellence for Health, Immunity and Infections) and
PERSIMUNE (Centre of Excellence for personalized medicine of infectious complications in
immune deficiency).
4
Relevant scientific contributions
The research presented in this thesis has been an important contribution to the following
studies:
Association Between Single-Nucleotide Polymorphisms in HLA Alleles and Human
Immunodeficiency Virus Type 1 Viral Load in Demographically Diverse, Antiretroviral
TherapyNaive Participants from the Strategic Timing of AntiRetroviral Treatment Trial.3
Ekenberg, C., Tang, M.-H., Zucco, A. G., Murray, D. D., MacPherson, C. R., Hu, X.,
Sherman, B. T., Losso, M. H., Wood, R., Paredes, R., Molina, J.-M., Helleberg, M., Jina, N.,
Kityo, C. M., Florence, E., Polizzotto, M. N., Neaton, J. D., Lane, H. C., & Lundgren, J. D.
The Journal of Infectious Diseases
,
2019,
220
(8), 13251334.
The association of human leukocyte antigen alleles with clinical disease progression in HIV-
positive cohorts with varied treatment strategies.4
Ekenberg, C., Reekie, J., Zucco, A. G., Murray, D. D., Sharma, S., Macpherson, C. R.,
Babiker, A., Kan, V., Lane, H. C., Neaton, J. D., Lundgren, J. D., & for the INSIGHT START,
S. S. G.
AIDS
,
2021,
35(5), 783789.
Human Immunotypes Impose Selection on Viral Genotypes Through Viral Epitope
Specificity.5
Gabrielaite, M.*, Bennedbæk, M.*, Zucco, A. G., Ekenberg, C., Murray, D. D., Kan, V. L.,
Touloumi, G., Vandekerckhove, L., Turner, D., Neaton, J., Lane, H. C., Safo, S., Arenas-
Pinto, A., Polizzotto, M. N., Günthard, H. F., Lundgren, J. D., Marvig, R. L., & INSIGHT START
Study Group.
*Contributed equally
The Journal of Infectious Diseases
,
2021,
224
(12), 20532063.
Readmissions, post-discharge mortality and sustained recovery among patients admitted to
hospital with COVID-19.6
Moestrup, K. S., Reekie, J., Zucco, A. G., Jensen, T. Ø., Jensen, J.-U. S., Wiese, L.,
Ostrowski, S. R., Niemann, C. U., MacPherson, C. R., Lundgren, J., & Helleberg, M
.
Clinical Infectious Diseases,
2022
(In press)
5
Supervisors
Supervisor:
Co-supervisor:
Co-supervisor:
Assessment committee
Chair
Assessor
Assessor
6
Table of contents
1. PREFACE AND ACKNOWLEDGEMENTS 8
2. LIST OF ABBREVIATIONS 9
3. SUMMARY 10
3.1. ENGLISH SUMMARY 10
3.2. DANSK RESUMÉ 11
4. INTRODUCTION 12
4.1. THE NEW PARADIGM OF PRECISION MEDICINE 12
4.2. INFERENCE AND PREDICTION IN PRECISION MEDICINE 13
4.3. ARTIFICIAL INTELLIGENCE, MACHINE LEARNING AND DEEP LEARNING 14
4.4. TRANSPARENCY, INTERPRETABILITY AND EXPLAINABILITY OF MACHINE LEARNING MODELS 15
4.5. ARTIFICIAL INTELLIGENCE FOR PATIENTS WITH INFECTIOUS DISEASES 16
4.6. PREDICTIVE MODELS DURING THE SARS-COV-2 PANDEMIC 17
4.7. PRECISION MEDICINE OF PATIENTS WITH INFECTIOUS DISEASES 18
4.8. HOST GENOMICS IN HIV INFECTION 19
5. OBJECTIVES 20
6. METHODS 21
6.1. DATA SOURCES 21
6.1.1. THE STRATEGIC TIMING OF ANTIRETROVIRAL TREATMENT (START) COHORT 21
6.1.2. ELECTRONIC HEALTH RECORDS IN THE CONTEXT OF THE SARS-COV-2 PANDEMIC 21
6.2. ETHICAL CONSIDERATIONS 22
6.3. ENCODING ELECTRONIC HEALTH RECORDS FOR MACHINE LEARNING 22
6.4. BIOINFORMATICS APPROACHES FOR VIRAL GENOMICS 22
6.5. SUPERVISED MACHINE LEARNING 23
6.5.1. HUMAN LEUCOCYTE ANTIGEN IMPUTATION 23
6.5.2. BINDING AFFINITY PREDICTION OF HLA CLASS I ALLELES TO HIV PEPTIDES 23
6.5.3. SURVIVAL ANALYSIS BY DISCRETE-TIME MODELLING 23
6.6. UNSUPERVISED MACHINE LEARNING 25
6.6.1. CONSENSUS CLUSTERING 25
7
6.7. TRAINING AND ASSESSMENT OF MACHINE LEARNING MODELS 25
6.8. EXPLAINABILITY OF MACHINE LEARNING MODELS 26
6.9. STATISTICAL ANALYSES 27
6.10. SOFTWARE AND VISUALIZATION TOOLS 27
7. SUMMARY OF RESULTS 28
7.1. STUDY I 28
7.2. STUDY II 30
8. DISCUSSION 33
8.1. IMPUTATION AND FUNCTIONAL CLUSTERING OF HLA ALLELES BY MACHINE LEARNING MODELS 33
8.1.1. ASSOCIATIONS OF FUNCTIONAL HLA CLASS I CLUSTERS WITH HIV VIRAL LOAD 33
8.2. EXPLAINABLE MACHINE LEARNING FOR SURVIVAL MODELS IN PRECISION MEDICINE 34
8.2.1. RISK FACTORS IN SARS-COV-2+ PATIENTS THROUGH MODEL EXPLAINABILITY 35
8.3. STRENGTHS AND LIMITATIONS 36
9. CONCLUSION 38
10. FUTURE PERSPECTIVES 38
10.1. HOW TO MODEL: FROM PREDICTIVE TO CAUSAL MODELS 39
10.2. WHAT TO MODEL: DATA-DRIVEN MODELS AND COMPLEXITY 39
11. REFERENCES 40
12. MANUSCRIPTS 47
8
1. Preface and acknowledgements
Back in early 2018, my interest in applying what I learned about Machine Learning to
understand our immune system and contribute to the fight against infectious diseases led me
to contact Jens Lundgren. He not only accepted me at CHIP and PERSIMUNE but since that
day he supported my endeavours balanced with scientific scrutiny guiding me not only in the
nuances of the medical world but also pushing me to become a better clinical researcher. For
doing so while always keeping a personal interaction I am immensely grateful and inspired.
I would also like to express my gratitude to my co-supervisors. Thanks to Cameron MacPherson
for his support and understanding as a fellow bioinformatician and curious being. Also, thanks
to Ole Winther for his sharp advice and pragmatic guidance when engaging in stimulating
scientific discussions. Our shared interest to bring Machine Learning to the clinic has been a
motivation. A special acknowledgement to Daniel Murray, his support, help and encouragement
to write has been pivotal for me to consider him as my unofficial co-supervisor.
This thesis would have not been possible without nurturing collaborations. Thanks to Christina
Ekenberg for her kindness and fruitful work together, Kasper Moestrup for always being there
side by side, Rudi Agius for shared moments and passion for our crafts. Thanks for the good
collaboration across departments with Rebecka Svanberg, Sisse Ostrowski, Carsten Niemann,
Marie Helleberg and Rasmus Marvig. Also, to the international collaborators from INSIGHT and
welcoming researchers in the translational immunology group at Institut Pasteur. My
appreciation also goes to the participants of the studies for their contribution.
I would like to say thanks to my fellow bioinformaticians Migle Gabrielaite, Mette Jørgensen,
Preston Leung, Ramtin Marandi, Kirstine Krøyer Rasmussen, Pernille Iversen and Man-Hung
Tang for their camaraderie and nice discussions. Also, to the fellow statisticians, Joanne Rekkie,
Erich Tusch and Quenia dos Santos. Moreover, thanks to Jens Christian and Marc Bennedbæk
for contributing to the Ph. in my PhD.
A huge thanks to Lisbeth Jørgensen, Helle Bo Duus and Lisbeth Bille for making my life much
easier with their help. I would like to acknowledge the talented team at PERSIMUNE and CHIP
that I had the honour to interact along the years: Dorthe Raben, Alvaro Borges, Bastian
Neesgaard, Cynthia Terrones Campos, Emma Illett, Isabelle Lodding, Cornelia Crone, Sara
Mørup, Sebastian Moretto, Christian Jensen, Riia Sustarsic, Jamshed Gill, Olga Fursa, Lars
Peters, Nadine Jaschinski, Frederik Woldbye... The list continues but I would also like to give
kudos to the IT department and public health teams for their work, with a special mention to
Tina Bruun for her joy and Anne Raahauge for the interesting interdisciplinary discussions.
On a personal level, I would not have reached far without the company, fun moments, and
therapeutic conversations with my dear friends Angélica, Jakub, Zulema, Maria Luisa, Rocío,
my Sydhavn crew including Jon, my beloved biochemists, my high school brothers, and friends
Emma and Lucy.
Last but not least, I want to thank the love and support of my wonderful partner Rasa, my
parents (Danilo and Miriam) and my family in Argentina, especially my grandmother and
godmother Sandra. My academic achievements are also theirs.
9
2. List of abbreviations
Abbreviation
Description
ACE2
Angiotensin converting enzyme 2
AI
Artificial Intelligence
AIDS
Acquired immunodeficiency syndrome
ART
Antiretroviral therapy
ATC
Anatomical Therapeutic Chemical
AUROC
Area under the receiver operating characteristic
CCR5
C-C chemokine receptor type 5
CD4
Cluster of differentiation 4
CI
Confidence interval
COVID-19
Coronavirus disease 2019
ESS
Error sum-of-squares
FPT
First positive SARS-CoV-2 test
HIV
Human Immunodeficiency Virus
HLA
Human Leucocyte Antigen
ICD-10
International Statistical Classification of Diseases and Related Health
Problems version 10
ICU
Intensive care unit
IML
Interpretable Machine Learning
INSIGHT
International Network for Strategic Initiatives in Global HIV Trials
KIR
Killer immunoglobulin-like receptors
mAI
Medical Artificial Intelligence
MCC
Mathew correlation coefficient
MHC
Major histocompatibility complex
NLP
Natural language processing
PCA
Principal Component Analysis
PDP
Partial dependence plot
PLWH
People living with HIV
PR-AUC
Precision-recall area under the curve
RNA
Ribonucleic acid
RT-PCR
Real-Time Polymerase Chain Reaction
RWD
Real-world data
SARS-CoV-2
Severe acute respiratory syndrome coronavirus 2
SHAP
SHapley Additive exPlanations
SNP
Single Nucleotide Polymorphism
START
Strategic Timing of AntiRetroviral Treatment
VL
Viral load
WHO
World Health Organization
10
3. Summary
3.1. English summary
Precision medicine is developing as a new paradigm in healthcare. To achieve its goals,
refined characterizations of patients and personalized clinical models are necessary.
Progress has been made in these areas by the inference of relevant genomics factors and
the proposal of better predictive models for clinical outcomes. This has been possible
through the development of statistics and computer science, in particular Machine Learning,
providing the tools for modelling the complexity in precision medicine. Nevertheless, to move
beyond purely predictive models, and gain insights from Machine Learning we need to
consider approaches to explain such models. This is especially relevant in the context of
infectious diseases, where the identification of patients at risk can improve clinical outcomes
and help understand critical disease mechanisms. An example of this has been observed
during the HIV pandemic in which some host genetic factors, in particular the Human
Leucocyte Antigen (HLA), have been linked to variation in HIV viral load. However, studies
of the HLA region have been challenging in heterogeneous populations due to its diversity.
More recently, during the SARS-Cov-2 pandemic, healthcare systems have been strained,
highlighting the need for early and precise patient risk assessment. In this thesis, Machine
Learning solutions to key problems in precision medicine of patients with infectious diseases
are presented in the context of HIV and SARS-CoV-2 infections powered by model
explainability.
In Study I, we investigated associations of HLA alleles to HIV viral load (VL) in a genetically
diverse cohort of 2546 HIV+ participants from the Strategic Timing of AntiRetroviral
Treatment (START) study by imputing and accounting for functional relationships between
HLA alleles. Machine Learning methods were used to impute and cluster HLA alleles based
on their predicted binding affinities to HIV and unspecific peptides. We found four major
functional clusters representing 30 HLA alleles accounting for over 11% of the variability in
VL previously reported in homogeneous cohorts. Some of these alleles while present in
distinct populations shared a common function reflecting similar effects. These effects were
also found using unspecific peptides hence the proposed methodology could be used for the
study of HLA alleles in other diseases.
In Study II, we developed a Machine Learning model to predict mortality within 12 weeks of
a first positive SARS-CoV-2 test based on 33938 cases during the first year of the COVID-
19 pandemic in Denmark. By implementing a discrete-time modelling approach leveraging
electronic health records into models with high performance, we could predict and explain
personalized survival curves and temporal dynamics. Among the final 22 features, age, sex,
number of medications, previous hospitalizations and lymphocyte counts were the top risk
factors for mortality. Compared to previous models developed for COVID-19 we account for
censored patients and missing values while providing explainable predictions that are
patient-specific. The proposed methodology can be used in other clinical problems as a
framework for predictive models in precision medicine.
11
3.2. Dansk resumé
Præcisionsmedicin er et nyt paradigme indenfor sundhedsbehandling, der forudsætter en
detaljeret karakterisering af patienter og kliniske modeller tilpasset den enkelte patient.
Prædiktive modeller for kliniske resultater. Forudsigelser af relevante genomiske faktorer og
bedre prædiktive modeller har været afgørende for den opnåede fremskridt. Dette har været
muligt gennem udviklingen af statistiske og datalogiske metoder, især maskinlæring, der r
det muligt at modellere kompleksiteten i data. For at komme ud over rent prædiktive modeller
er vi udvikle metoder til at fortolke modellerne og deres forudsigelser. Dette er især relevant i
forbindelse med infektionssygdomme, hvor identifikation af patienter i risikogrupper, kan
forbedre kliniske resultater og hjælpe med at forstå kritiske sygdomsmekanismer. Et eksempel
på dette er blevet observeret under HIV-pandemien, hvor nogle genetiske værtsfaktorer, især
Human Leucocyte Antigen (HLA), er blevet forbundet med variation i HIV-virusmængden.
Imidlertid har undersøgelser af HLA-regionen været udfordrende i heterogene populationer.
Under SARS-Cov-2-pandemien er sundhedsvæsner blevet overbebyrdet, hvilket tydeliggør
behovet for tidlig og præcis patientrisikovurdering. I denne ph.d.-afhandling præsenteres
maskinlæring-løsninger til præcisionsmedicin for patienter med HIV og SARS-CoV-2 infektioner
drevet af modelforklarlighed.
I Studie I, undersøgte vi associationer mellem HLA-alleller og HIV-viral load (VL) i en genetisk
forskelligartet kohorte af 2546 HIV+-patienter fra Strategic Timing of AntiRetroviral Treatment
(START)-studiet ved at tilregne og redegøre for funktionelle relationer mellem HLA-alleler.
Maskinlæringsmetoder blev brugt til at imputere og gruppere HLA-alleller baseret deres
forudsagte bindingsaffiniteter til HIV og uspecifikke peptider. Vi fandt fire store funktionelle
klynger, der repræsenterede 30 HLA-alleler, der tegner sig for over 11% af variabiliteten i VL,
der tidligere er rapporteret i homogene kohorter. Nogle af disse alleler, der var til stede i
forskellige populationer, delte en fælles funktion, der afspejlede lignende virkninger. Disse
virkninger blev også fundet ved hjælp af uspecifikke peptider, og derfor kunne den foreslåede
metode bruges til undersøgelse af HLA-alleller i andre sygdomme.
I Studie II, udviklede vi en maskinlæring model til at forudsige dødelighed indenfor 12 uger efter
en første positiv SARS-CoV-2-test baseret på 33938 tilfælde i løbet af det første år af COVID-
19-pandemien i Danmark. Ved at implementere en diskret-tidsmodelleringstilgang, der udnytter
elektroniske sundhedsjournaler til modeller med høj nøjagtighed, kunne vi forudsige og forklare
personlige overlevelseskurver og tidsmæssig dynamik. Af de endelige 22 karakteristika var
alder, køn, medicinforbrug, tidligere indlæggelser og lymfocyttal de største risikofaktorer for
dødelighed. Sammenlignet med tidligere modeller udviklet til COVID-19 tager vi højde for
censurerede patienter og manglende værdier, mens vi giver fortolkelige forudsigelser, der er
patientspecifikke. Den foreslåede metode er general og kan derfor anvendes til andre
prædiktive præcisionsmedicin-problemstillinger.
12
4. Introduction
Concepts such as precision medicine, artificial intelligence, machine learning or model
explainability are populating the medical literature. Understanding what these terms mean is
critical to establishing a basis of knowledge in which these concepts can be later actualised.
In this chapter, definitions of these important topics will be expanded and presented in the
context of patients with infectious diseases, in particular, those affected by the Human
Immunodeficiency Virus (HIV) and the Severe Acute Respiratory Syndrome Coronavirus 2
(SARS-CoV-2).
4.1. The new paradigm of precision medicine
In recent years, “precision medicine” has become a new focus in clinical research and
practice. This concept has gained favour over the term “personalised medicine” under the
premise that physicians have always aimed at providing individual and personalized
treatments informed by a personal relationship with their patients7. While the two terms are
used interchangeably, precision medicine emphasizes the need for incorporating a deeper
characterisation of patients to tailor disease prevention and treatment. Despite a clear
definition is still debated, precision medicine can be understood as an iterative process of
accurate patient stratification up to the individual level through the development of clinically
relevant models that incorporate genomic, clinical, lifestyle and environmental information8
Key to the development of precision medicine is the recent efforts in the digitalization of
healthcare systems and the reduced costs of deep phenotyping of patients through high-
throughput approaches7. The progressive implementation of electronic health records (EHR)
has allowed the widespread gathering of data from populations around the world influencing
clinical decision-making9. The availability of this vast amount of information not only has
increased the quality and quantity of data available for research but also enables the
possibility of disease surveillance through national databases. The availability of affordable
high-throughput techniques has facilitated the collection of detailed biological information at
the individual level. The most relevant case can be observed in the advent of array
technologies since the early 2000s10. These methods have allowed the study of genomic
variation to the level of single nucleotide polymorphisms (SNPs) powering a boom in genome-
wide association studies (GWAS). The growing interest in such technologies has catalysed
the launch of consumer-based platforms and national initiatives to genotype whole
populations. A clear example of this can be observed in the Nordic countries where a
combination of robust health registries, universal healthcare and investment to collect
genotypes and genomes of their citizens prove as fertile ground for precision medicine11
Despite the progress made in terms of generating high-quality data, concerns arise on how
to turn the rich and complex clinical information gathered into models of disease prevention
and treatment adapted to the individual. This can be partially attributed to outdated
epidemiological and statistical methods that focus on the average person12. While useful to
reach the current scientific knowledge, new methodologies are needed to model the
complexity present in medicine and healthcare13.
13
4.2. Inference and prediction in precision medicine
Since the scientific revolution in the 16th century, the method of observation, hypothesis
testing through experimentation and proposal of scientific models have been enriched by
mathematical applications. This evolution gave birth to modern statistics in the late 19th
century coupled with the ever-growing data collection and techniques for measuring relevant
metrics facilitated by the industrial revolution. The simplest expression of this field can be
seen in descriptive approaches to summarize data or represent data structures14
corresponding to the most general understanding of statistics. Statistical learning or
statistical modelling builds and expands this basic understanding to capture relationships
between measured and future observations. It can be represented as a set of approaches
for estimating a function
f
that links a set of observable variables
X
to some response
Y
in
the form:
 󰇛󰇜 
Where ε corresponds to a random error term, independent of
X
. Estimating
f
is needed for
inference and prediction. When performing inference the main goal is to understand the
relationships between the variables
X
and the response
Y
or in other words how
Y
change
as a function of
X
. Through inference we can answer questions about associations, relevant
variables or the relationship of each of them to the response of interest15. For prediction
tasks, given a new set of values for the input variables
X
the goal is to accurately predict new
responses
Y
given that the irreducible error term ε is low enough15. In any case, how we
estimate
f
depends on the task of interest. For inference, we need to know the exact form of
󰆹 to define the relationships between
X
and
Y
whereas for prediction tasks,
󰆹 does not have
to be defined as far as it provides accurate predictions for
Y.
This type of reasoning
corresponds to cases in which responses or outcomes have been measured in a supervised
learning setting. Alternatively, relationships between the variables can be explored without a
know response or outcome in an unsupervised manner.
Traditionally, inference has been prioritised in the scientific literature over prediction. This
has led to a predominance of data models, in which to define the relationship between
variables and responses of interest, a known stochastic data model is proposed to estimate
f.
Examples of this can be seen in linear regression, logistic regression or Cox models where
goodness-of-fit tests and residual analyses are carried out to assess how well the model fits
the data16. This approach proved to be efficient to estimate the parameters of the proposed
models through direct mathematical calculation. Later on, with the development of computer
science, the algorithmic modelling culture gained relevance. With this approach, an
algorithm is applied on
X
to predict
Y
for which the form of
f
is not previously defined, and it
is learned based on the data. In this case, the model parameters are estimated by an
optimization process denominated training and then assessed by their predictive accuracy
based on excluded data during the model generation16. Examples of such algorithms are
neural networks and decision trees.
In the context of precision medicine, these foundational notions of statistical learning have a
critical impact on the understanding of findings in the field and its objectives. The preference
to perform inference through data models require very strong and restrictive assumptions to
define
f
and describe the relationship between the input variables
X
and the response
Y
.
14
Among these assumptions, the most common ones are that such relationships are additive
and linear15. Imposing such assumptions for the understanding of biological systems, while
useful as approximations, they have been proved problematic in genetic studies where
alternatives such as multiplicative models have been proposed17. Furthermore, the validity of
such data models and assumptions is conditional on their fit to the data, which is rarely
assessed or tested16. Erroneous conclusions can be drawn based on coefficients of such
models using a significance level of 5% without proper testing of assumptions and model
fit18. This is especially relevant when analysing high-throughput data where multiple testing
corrections are required19. Even if data models are correctly implemented, they are
commonly used to test associations on observational data implying a particular type of causal
reasoning different from the causal assumptions used in an interventional setting such as
randomised clinical trials. Due to these concerns, statisticians have argued that predictive
models can help overcome some of these limitations by relaxing assumptions and focusing
on reproducibility14. A popularisation of predictive modelling is being fuelled by new advances
in computer science, algorithms, big data approaches and accessibility of affordable
computing.
4.3. Artificial intelligence, Machine Learning and Deep Learning
Technological advancements powered by research in mathematics framed under new
philosophical considerations have fostered the idea of artefacts or machines that could
perform human tasks. Since the first mechanical computer was invented in the early 19th
century and the later proposal of the modern computer in 193720, access to computers and
their capabilities have improved, becoming widespread in our society. Although coined in the
50s, the concept of Artificial Intelligence has become popular in the last decade as a goal to
materialize the idea of technology aiding humans through automatization.
Artificial intelligence (AI) can be understood as the capabilities of machines to enact
knowledge encoded in formal language using logical inference rules21. Initially, this
knowledge was explicitly encoded by humans in a “knowledge-based” approach. An
example of this is the usage of “if-else” statements as a basic form of AI. Such tasks while
formally simple to define are more challenging for humans to perform compared to a
computer. Despite being successful in many domains including medicine22 these systems
encounter limitations in problems such as image and speech recognition that are complex
to formalize in simpler rules but intuitive for humans. To overcome this, Machine Learning
(ML) was proposed as a branch of AI to emphasize the ability of computational systems to
acquire knowledge. In this way, the focus shifted to developing algorithms that could learn
complex rules from data through error minimization. The main challenges in ML are how to
encode data into numerical vectors so such algorithms can learn meaningful representations
and how to define formal objectives to optimize according to the task at hand. The
importance of representation and abstractions learned by computational systems as a
successful approach for complex (but intuitive) tasks such as speech recognition and image
classification led to the development of Deep Learning23. As a sub-field of Machine Learning,
Deep Learning (DL) arose from the popularization of Artificial Neural Networks (ANNs),
algorithms inspired by mathematical models of biological neurons. After the proposal in the
mid-80s of stochastic gradient descent to train the parameters of such networks24, these
15
could accommodate more layers of neurons becoming “deeper” in their structure hence.
Nowadays, AI is conflated with ML or DL hence noticing the hierarchical ontology of such
concepts is necessary.
Prediction in the context of ML has been denominated supervised learning”. In this setting,
a set of input variables named “features” are used for training models to predict a known
response or output. Depending on the type of output value being quantitative or qualitative
ML can be used to perform regression or classifications respectively25. The input features
can be of any type as far as they can be encoded numerically into feature vectors.
Nevertheless, special considerations are taken when dealing with temporal data which can
be framed in the context of forecasting, time series or survival analysis26. If an output value
is not available or prediction is not required, ML can be used also to learn representations,
patterns and clusters from data in what is called unsupervised learning. This approach has
been argued to be similar to how humans and animals learn23. Nevertheless, while successful
at prediction and representation learning, inference through ML has been complicated due
to the complexity of the models generated. The lack of transparency from such complexity
has been denominated as a “black box”. To overcome this, new concepts and computational
approaches have been proposed to open the black box.
4.4. Transparency, interpretability and explainability of machine learning
models
Understanding how predictions of machine learning models are generated is critical for
scientific inquiry27. The importance of such understanding is even codified into law as these
models infuse other areas of society28. However, the challenges of generating explanations
are not only applicable to ML but an important epistemological problem in other areas of
knowledge29. In the last decade, the field of explainable AI (xAI) or interpretable machine
learning (IML) has gained popularity but it is still under a process of consolidation30,31. Due to
this, diverse ontologies have been proposed to distinguish between approaches to model
explainability. A useful framework for navigating xAI can be laid out by distinguishing between
transparency, interpretability and explainability of ML models27.
Transparency refers to the understanding of the mechanisms by which a model works30. This
understanding can be of the entire model to the point that it could be simulated by a human,
the ability to decompose a model in its parts or to describe the whole process of a model to
generate an output31. As pointed out in 4.2, this level of understanding is a key aspect of
statistical inference and examples of such approaches can be seen in linear regression,
logistic regression, decision trees or principal component analysis (PCA). More sophisticated
approaches for transparent models have been proposed in the context of ML by the use of
symbolic regression32. In the context of medicine, risk scores that can be calculated by
clinicians are examples of transparent models.
Models with higher predictive performance tend to be opaquer since their complexity can
potentially accommodate the complexity of the data. To illuminate this, model interpretability
aims to present properties of an ML model in terms of human understanding27, i.e., mapping
an abstract concept (model prediction) into a domain that a human can make sense of33.
Approaches to ML interpretability can be model-specific, such as feature importance
16
methods for tree-based models, or model-agnostic (post hoc). Examples of model-agnostic
approaches have been proposed, for example, by creating proxy linear models locally34,
using Shapley values from game theory through SHAP values35 or mathematical
decomposition of neural networks by layer-wise relevance propagation (LRP)33.
Explainability of ML models is sometimes used interchangeably with interpretability, but some
authors have highlighted the differences29,33. Explainability emphasizes the consideration of
contextual information from domain knowledge related to the analysis goal to generate
explanations based on model interpretations which cannot be achieved only algorithmically27.
While model interpretability is purely descriptive of the model outputs in relation to their
inputs, model explainability accounts for causal notions by the user to explain not only how
but why the model could have provided a decision based on domain knowledge. Hence
explanations in the context of ML are subject to the same challenges as in other areas of
science which in this case make it dependent on the human-agent interaction29. Examples
of such challenges are limitations in the explanations due to current domain knowledge or
bias in selecting explanations among multiple possible ones. Nevertheless, generating
explanations based on accurate predictive models can potentiate scientific discoveries by
unravelling the complexity in data through Machine Learning.
In high-stake fields such as healthcare, the need for explanations of algorithms for clinical
decision-making is debated. Pragmatist positions defend that a good performing model
should be used as is done in other areas of medicine where certain drugs or interventions
have been used without full knowledge of the biological or clinical mechanisms behind36,37.
An argument for the need for xAI in healthcare is based on building trust in clinicians and
patients regarding the use of algorithms. We tend to rely more on human medical decisions
even in contexts where AI outperforms humans, but interventions have been proposed to
bridge the gap and encourage the use of algorithms38. Nevertheless, ML models can be
subject to biases39 or problems that require further inspection and scepticism about their
predictions for which not only model explainability can help but also good practices during
model development and implementation.
4.5. Artificial intelligence for patients with infectious diseases
AI applied to medicine has been increasingly improving healthcare over the years. Early
approaches from the 1970s used ruled-based approaches for diagnosis or signal
processing22. In recent years, the focus has been on Deep Learning applications for
processing images since the promising results in image classification observed since 201222.
Examples of these implementations can be seen in radiology where algorithms are tested to
process X-ray images and applications for other types of medical scans such as computed
tomography or imaging in pathology, dermatology, ophthalmology, gastroenterology or
cardiology. Despite such models being the least transparent, their results can be visually
assessed against experts where performance between clinicians and algorithms can be
compared37.
Other areas of medicine in which ML can offer solutions are in the context of health systems.
The increasing amount of health records that are digitalised allows for training ML models to
perform diverse tasks and predict clinical outcomes such as readmissions, mortality,
17
diagnosis or patient monitoring22. Most of these models have been developed and assessed
on retrospective data but have not been evaluated in clinical settings. This is important to
notice since some methodological limitations can prevent the success of these technologies.
Some of these challenges are the lack of predictions at the individual level for precise risk
assessment37 or certain biases when ignoring censored patients40 in medical AI (mAI).
In the context of infectious diseases, diverse solutions have been proposed based on
different data sources. Examples of these solutions are the development of models to detect
tuberculosis in X-rays41, diagnosis of infectious diseases based on structured data from
medical records42, early prediction of sepsis utilizing Natural Language Processing (NLP) in
unstructured data from clinical notes43 or predicting the risk of infection in patients with
immune dysfunction44. Most of the recent applications of ML for patients with infectious
diseases have been in the context of the SARS-CoV-2 pandemic where the crisis opened
the opportunity for technological innovation45.
4.6. Predictive models during the SARS-CoV-2 pandemic
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a single-stranded
RNA virus that causes coronavirus disease 2019 (COVID-19). Since its outbreak COVID-19
has resulted in over six million reported deaths worldwide46. The SARS-CoV-2 binds to the
angiotensin-converting enzyme 2 (ACE2) which is highly expressed in vascular endothelial
cells of the lungs47 making it a selective respiratory virus. It has been reported that up to 40%
of cases at the time of testing were asymptomatic or presymptomatic48. The most common
symptoms are cough, fever, fatigue, headache, sore throat and loss of smell49. Based on
estimates from Europe in 202150, approximately 6% of infected individuals would progress
into severe disease requiring hospitalization. Furthermore, 14.2% of hospitalised patients
would be admitted to the Intensive Care Unit (ICU) or require supportive oxygen therapy.
The overall mortality rate among COVID-19 cases was 1.5%. These numbers contrast with
those from Denmark in 2020 where a 20% hospitalization rate, 2.8% admission rate to an
Intensive care unit (ICU) and a 5.2% fatality rate within 30 days were reported based on
patients that were positive for SARS-CoV-251. The differences in these rates capture the
evolving nature of the pandemic in which not only mutations in the virus as new variants were
observed but also the impact of public health interventions such as vaccination and new
treatments. While these changes seem to point to better management of the pandemic in
terms of mortality, the fast-paced nature of such changes has challenged predictive models.
Since the start of the SARS-CoV-2 pandemic, at least 180 ML approaches have been
proposed to tackle various clinical challenges52. These include solutions from a population
level such as forecasting dynamics of the disease, predicting the impact of interventions or
outbreak detection to an individual level by modelling diagnosis, triage, or disease prognosis.
These models employ diverse data sources, including images e.g., radiography53,
unstructured text from clinicians or epidemiological data. To develop such models a
considerable amount of data is needed. However, due to the urgency of the pandemic, faster
solutions were required. Leveraging existing structured data in the form of EHR allowed to
not only access quickly vast amounts of relevant clinical features54 but also pave the way for
implementing ML models in the healthcare systems. In particular, EHR can be used to
develop prognosis prediction and multiple models based on traditional statistical frameworks
18
or ML methods have been reported. Nevertheless, most of these models were at high risk of
bias or poorly reported55,56. While some reasons for these issues are due to the quality and
quantity of the available data or the pre-selection of variables based on limited knowledge of
a new disease, some problems can be attributed to methodological assumptions and the
implementation of ML algorithms. The main methodological shortcomings of such studies
are the lack of explainability of the ML models and failure to consider censored individuals40,57
without relying on strong assumptions such as proportional hazards when performing
survival analysis26. Solutions to these issues are presented in Study II and explained in the
following chapters of this thesis (6.5.3 and 6.8).
4.7. Precision medicine of patients with infectious diseases
The areas of major interest in precision medicine have been in the context of prenatal
diagnostics, genetic diseases and oncology58. National initiatives have even emphasized
some of these areas such as oncology as a short-term focus to materialize the benefits of
precision medicine59. Nevertheless, despite this focus and scientific breakthroughs in the
area for the past 20 years, the implementation of these advancements in the clinic has
proved insufficient60.
Often overlooked, one area where precision medicine is having a great impact is in the
context of infectious diseases. Due to the early development of microbiology among the
biological sciences, the knowledge of the molecular mechanisms of pathogens coupled with
the increasing access to high-throughput techniques in the clinic has allowed the
development of better diagnostic tests61. An example of this has been not only the rapid
diagnosis of individuals infected by the SARS-CoV-2 virus but also the specific variant
involved62. For other pathogens, such as bacteria, a rapid diagnosis and specific
characterisation can inform treatment and reduce the risk of antibiotic resistance63. Rapid
and accurate pathogen characterisation and diagnosis is the initial step in which precision
medicine has impacted care. Once a pathogen has entered an organism, assessing the
disease progression is also necessary. This is especially relevant in the case of sepsis, one
of the major causes of death from any infectious disease, where the combination of quick
diagnosis and assessment of biomarkers is critical64.
While precision medicine maximizes its efforts in treating individuals, the development in the
field is also applicable at a population level. One important area of research has been the
understanding of genomic factors from hosts and their interplay with pathogens. In this
sense, the use of genomic approaches has proved increasingly beneficial in the context of
epidemiology65 and the widespread popularisation of GWAS has also expanded our
knowledge of infectious diseases66. Such genetic data to study infections has not only been
collected in the context of research studies but also through the analysis of consumer-based
genomic tests67. A particular disease for which global efforts have been put to find treatments
and understand the mechanisms of interaction with the host is the Human Immunodeficiency
Virus type 1 (HIV).
19
4.8. Host genomics in HIV infection
The Human Immunodeficiency Virus (HIV) type 1 is a single-stranded RNA retrovirus first
isolated in 198368 that has so far claimed 36.3 million [27.247.8 million] lives and by the end
of 2020, 37.7 million [30.245.1 million] people living with HIV (PLWH) were reported
according to the World Health Organization (WHO)69. HIV is transmitted via the exchange of
body fluids that can occur during sex, pregnancy, childbirth, or blood contact with infected
materials such as needles. Once in the organism, HIV targets mainly CD4+ T lymphocytes,
a type of immune cell, by binding to CD4 molecules and a co-receptor (CCR5 or CXCR4)70.
Primary HIV infection is characterised by an initial spike in HIV virions in the blood, measured
in terms of viral load (HIV-VL). However, due to a high mutation rate of the virus, HIV
eventually overcomes the immune response after a period of stability. This results in an
increase in HIV-VL and depletion of CD4+ T cell lymphocytes70. Due to these dynamics, the
levels of CD4+ T cells and HIV viral load (VL) measured in blood have been used as markers
of disease progression. Over time, the reduced levels of CD4+ T cells can result in acquired
immunodeficiency syndrome (AIDS) which can ultimately lead to death by opportunistic
infections when untreated70. Antiretroviral therapy (ART) has proved effective for nearly two
decades to suppress viral replication. Despite the available therapies cannot cure the
infection, early detection and treatment71 that is sustained can control the virus to
undetectable levels which equate to being untransmittable72.
Most of the achievements to fight the disease were accomplished by intensive research in
the lab and the clinic combined with public health initiatives. Over the years an increasing
interest arose to understand the susceptibility of individuals to control HIV for which basis
could be found in genetic markers. Multiple GWAS studies have been conducted to assess
diverse HIV-related outcomes with consistent findings of associations in the genes encoding
for the C-C chemokine receptor type 5 (CCR5) and the Human Leucocyte Antigen (HLA)73.
In particular, HLA alleles have been reported to explain up to 12% of the variability in HIV
VL74 by imposing pressure on viral genetic diversity which by itself can explain up to 20%-
46% of the variability in VL75. A common problem of GWAS is the lack of diversity in the
cohorts being predominantly of European ancestry76. This is relevant in the context of HIV
where the global pandemic has severely impacted Africa and cofounding effects could bias
the results of HIV studies due to population stratification and regional differences. In recent
years new studies have confirmed some of the previous findings in more diverse populations
not only at the host level3 but also reported interactions with viral genomes5. Nevertheless,
due to the high variability of the HLA region, the most variable region in the human genome77,
big sample sizes are required to study its effects. Despite some attempts that have been
made to increase the diversity of the cohorts78 we proposed the consideration of HLA
function to group similar alleles these issues by enriching the cohorts with useful information.
This approach aided by ML is presented in Study I.
20
5. Objectives
This thesis aims to present Machine Learning applications for the development of precision
medicine in patients with infectious diseases. This is outlined by proposing computational
solutions to two major challenges in precision medicine: how to infer relevant host genetic
factors in heterogeneous populations (Study I) and how to predict patient-specific risk while
accounting for censored individuals (Study II). For both challenges, the implemented models
are explained based on domain knowledge of biological systems and disease aetiology
supported by methods of model interpretability. This corresponds to the secondary aim of
developing not only predictive models but also deepening the understanding of HIV host
genomics and SARS-CoV-2 risk factors respectively. The specific objectives of each study
were:
Study I Associations of functional HLA class I groups with HIV viral load in a heterogeneous
cohort.
To assess if functional clustering of the main host genetic factors involved in HIV control,
Human Leukocyte Antigen alleles, based on predicted binding affinities to HIV peptides
facilitate the study of HLA alleles in demographically heterogeneous cohorts.
Study II Personalized survival probabilities for SARS-CoV-2 positive patients by explainable
machine learning
To implement survival machine learning models for predicting personalized 12-week
mortality of SARS-CoV-2 positive patients by leveraging electronic health records and
describing temporal dynamics of relevant risk factors through model explainability.
21
6. Methods
6.1. Data sources
6.1.1. The Strategic Timing of AntiRetroviral Treatment (START) cohort
Study I is based on an international cohort from the Strategic Timing of AntiRetroviral
Treatment (START) clinical trial run by the International Network for Strategic Initiatives in
Global HIV Trials (INSIGHT). This cohort consists of asymptomatic HIV+ and ART-naïve
individuals with two CD4+ cell counts >500/μL at least 14 days apart within 60 days of
enrolment in the trial71. The objective of the trial was to determine if early initiation of
antiretroviral therapy improved outcomes on asymptomatic HIV+ individuals while thoroughly
characterising such populations across the world. A total of 4685 patients were enrolled from
35 countries across 5 continents between 2009 and 2013. Genotypic and baseline data
were available for 2546 patients using Affymetrix Axiom SNP array3. Furthermore, viral HIV
RNA paired-end sequences were generated from 2079 patients.
6.1.2. Electronic Health Records in the context of the SARS-CoV-2 pandemic
In Study II, access to raw electronic health records from the Capital Region and Region
Zealand (eastern Denmark) was granted and supported by the Danish Ministry of Higher
Education and Science in the framework of the COVIMUN project. The EHR were provided
as data extracts from the electronic patient journal (EPJ) by EPIC systems. The final dataset
was composed of observations up to the 2nd of March 2021 with historical data available up
to at least 3 years before. 963,265 individuals over 18 years old were identified with a Real-
Time Polymerase Chain Reaction (RT-PCR) SARS-CoV-2 test taken at test sites connected
to the EHR system between the 17th of March 2020 and the 2nd of March 2021. The data
available contained demographics, hospitalizations, vital parameters, laboratory test results,
diagnoses and medicines (ordered and administered) used for routine care by clinicians.
Figure 1. Consort diagram of the SARS-CoV-2 cohort.
22
6.2. Ethical considerations
In Study I, access to the data in the START clinical trial (NCT00867048)71 was granted by
the International Network for Strategic Initiatives in Global HIV Trials (INSIGHT). Written
consent for the study and genetic analyses were obtained from the participants and
approved by participants’ site ethics review committees during the trial.
In Study II, approval to access EHR was provided by the Danish Regional Ethical Committee
in the Capital Region (H-20026502) and Data Protection Agency (P-2020-426) ensuring
compliance with the required ethical and legal regulations. Following the Danish law,
informed consent from patients can be waived given that approval from the Ethical
Committee is obtained before access to EHR for research purposes.
6.3. Encoding electronic health records for Machine Learning
To process biological and clinical information with an ML algorithm, data needs to be
encoded in meaningful feature vectors and clinical outcomes according to the prediction
task. In Study II, longitudinal data from raw EHR was used to engineer features and predict
the risk of death 12 weeks after a first positive SARS-CoV-2 test (FPT).
Multiple time windows and summary statistics can be used for feature engineering. We opted
for a simple approach to facilitate the interpretation of the final feature set. We encoded basic
characteristics such as age, sex, and body mass index (BMI) as the latest value observed
up to the day of FPT. For continuous values as in the case of vital parameters (e.g systolic
blood pressure) and laboratory test results (e.g blood cell counts), we considered the latest
value observed in the last month before the FPT. For categorical variables such as
diagnoses, in the form of International Statistical Classification of Diseases and Related
Health Problems version 10 codes (ICD-10), or medications, as Anatomical Therapeutic
Chemical (ATC) codes, we used the counts of such codes in the last three years and one
year respectively. Some extra features were added such as the number of weeks since the
start of the pandemic until the FPT was taken and an indicator if the patient was hospitalized
when the FPT was performed. Previous hospitalisations were included as a variable for
hospital stays longer than 24h encoded as cumulative days in hospital within the last three
years. Missingness was considered not at random hence imputation was not performed. For
diagnoses and medications, the lack of a code was encoded as zero and for continuous
variables such as laboratory values and vitals, missingness was codified as missing values.
6.4. Bioinformatics approaches for viral genomics
In Study I, in addition to host genomics, viral HIV RNA paired-end sequences were available
based on Illumina MiSeq sequencing covering two amplicons in the HXB2 genome
positioned 1485-5058 and 5967-9517. Preparation and quality control of this data has been
reported in a previous study5. While different methods exist for multiple alignments of viral
sequencing79, we opted for a simpler approach to generate putative epitopes from HIV
sequences. To do this the raw reads were fragmented using KAT80 to 27 bases long to then
translated into peptides of 9 amino acids long. This length corresponds to the average length
of HLA class I epitopes81. The resulting peptides were mapped to 10 major HIV proteins (Asp,
23
Gag-Pol, Nef, Vpr, Vpu, gp160, Vif, Pr55, Rev, Tat) using BLAST by considering exact
matches to the reference proteins extracted from NCBI with an E-value < 1E-05.
Alternatively, to generate a random peptidome, sequences 9 amino acids were collected by
processing half-million protein sequences from Uniprot.
6.5. Supervised Machine Learning
6.5.1. Human Leucocyte Antigen imputation
In Study I, genotypic data was used to impute classic HLA alleles given the relevance of
these loci in GWAS studies3. While multiple methods for HLA imputation have been
proposed, Machine Learning approaches have proved to be efficient when imputing multi-
ethnic populations82. The method of choice, HIBAG83, is based on a bagging approach where
an ensemble of classifiers is trained on SNP genotypes for which the HLA haplotypes are
known by bootstrapping samples and averaging the posterior probabilities of the predicted
HLA haplotypes. Pre-trained models on SNPs present in Affymetrix UK Biobank Axiom Arrays
using multi-ancestry data from multiple GlaxoSmithKline clinical trials and HapMap phase 2
were available. The model predicted the posterior probabilities of classic HLA class I (HLA-
A, HLA-B, and HLA-C) and class II (HLA-DP, HLA-DQ, and HLA-DR) alleles at 4-digit
resolution given the genotypes. The posterior probabilities measure the uncertainty of the
resulting predictions. Only HLA alleles with probabilities 0.5 were called setting the rest to
missing values.
6.5.2. Binding affinity prediction of HLA class I alleles to HIV peptides
For assessing functional similarities between HLA class I alleles we predicted their binding
affinities to peptides in the context of Study I. To do so we used NetMHCpan 4.081, a method
based on neural networks trained on extensive sets of peptides where the binding affinities
to HLA alleles have been experimentally assessed. This model has been reported as one of
the best-performing methods in the latest benchmarks84 with an accuracy close to the
experimental assessment of binding affinities. We predicted binding affinities of all 268 HLA
class I HLA alleles available in the model to the HIV and random peptides previously
described in 6.4. From the predicted immunopeptidomes, informative subsets were selected
based on (i) peptides from the top 10% binding affinities, (ii) peptides within the top 10% of
the variability in binding affinities across HLA class I alleles and (iii) binders to at least 10%
of the alleles defined by a binding affinity <500nM85.
6.5.3. Survival analysis by discrete-time modelling
It is common in clinical research that individuals leave a study before its end or follow-up of
the outcome of interest is discontinued. This is known as right-censoring and contrasts with
left-censoring which occurs when an event of interest has happened before the start of a
study but the time of when the event happened is unknown. A plethora of methods has been
proposed for inference of the time to an event or, more common in epidemiology, survival
analysis. The Kaplan-Meier estimator and the Cox proportional hazards model are examples
of such methods that can account for censoring. In the context of prediction with ML models,
a common approach is to remove the censored observations and perform binary
24
classification which has been reported to generate biased models40,57. Different ML
approaches have been suggested to perform survival analysis26 but due to the usage of
strong assumptions and complex loss functions the interpretability of these models is
challenging based on existing tools. The most common assumption is that survival time is
continuous and the events of interest can happen at any time point. However, when
measured, time is always discrete even if conceptualised as continuous. In contrast,
modelling approaches considering time as discrete present numerous advantages, primarily
they can provide a closer representation of the data based on how it was observed.
In Study II, we implemented discrete-time modelling to predict 12-week mortality in SARS-
CoV-2 positive patients after observing right-censoring in patients tested after the 8th of
December 2020 (12 weeks before the data extraction) for whom follow-up was not
completed. This approach, originally described by Cox as an approximation to his
proportional hazards assumption for continuous-time modelling86, allowed us by discretizing
time in predetermined intervals to train binary classifiers at each time interval87. This was
done by augmenting the data longitudinally by repeating each feature vector for an individual
as many times as time intervals the individual was observed (Figure 2). A variable
representing the time intervals was added as an input feature. The target value will contain
zeroes up to the row of the last time interval in which an individual was observed. If the patient
died it would be encoded as 1 and if the individual was alive as 0. When using the model for
new predictions, every individual was longitudinally augmented up to the maximum time of
prediction (12 weeks in our case) and accordingly indicated by the time feature.
Figure 2. Example of data transformations for discrete-time modelling.
25
The predicted probabilities of death at each time interval constitute the hazard function
󰇛󰇜 which can be also expressed as a survival function 󰇛󰇜 and a cumulative density
function 󰇛󰇜 as defined below:
󰇛󰇜 󰇛  󰇜
(1)
󰇛󰇜 󰇛 󰇜  󰇛󰇜

(2)
󰇛󰇜 󰇛 󰇜 󰇛󰇜
(3)
6.6. Unsupervised Machine Learning
Despite a target value is not used for unsupervised learning, metrics to measure similarity
between observations are needed. For Study I, the dissimilarity between HLA alleles,
represented as different sets of predicted binding affinities (see 6.5.2), was measured as
cosine, Pearson correlation and Euclidean distances between these feature vectors. The
resulting distance matrices were then processed using hierarchical clustering to generate
dendrograms of functionally related HLA alleles. To do so two distinct types of linkage
functions were used to agglomerate observations hierarchically. On one hand, average
linkage joins clusters with the shortest average distance between each other iteratively, on
the other hand, a Ward linkage merges clusters by minimizing the error sum-of-squares
(ESS). While sometimes overlooked, the Ward linkage requires distances in Euclidean space
hence the cosine and correlation matrices were corrected by their square root88.
6.6.1. Consensus clustering
When using ML methods, different algorithms, parameters and hyperparameters can yield
diverse valid solutions. To overcome this, ensemble learning has become a popular
approach in supervised machine learning in the form of bagging (e.g., in 6.5.1), boosting or
stacking. However, similar principles can be used with unsupervised learning to avoid bias
and generate better biological representations89. In the context of Study I, we implemented
consensus clustering90 to generate robust clusters based on the different subsets of
immunopeptidomes, distance metrics and linkage functions. A consensus matrix (Cij) of size
(n x n) was generated where each element contained the number of times an ith allele
clustered together with a jth allele when an increasing number of clusters were selected (3
to 160). After transforming the values into dissimilarity scores (1 Cij), the consensus matrix
was then processed by hierarchical clustering with average linkage
6.7. Training and assessment of Machine Learning models
Estimating the parameters and assessing the performance of predictive models such as ML
models for supervised learning differs from data models described in 4.2. A strict assessment
of the hyperparameter optimisation and predictions in data unseen by the model is required.
Moreover, avoiding data leakage is critical to prevent ML algorithms from learning
26
information that would not be accessible otherwise when generating new predictions. This
would restrict the generalisation capabilities of the model and potentially lead to overfitting.
In Study II, we used cross-validation in two steps to split the data into different sets with the
same mortality and time distributions. First, 60% of the data (training set) was divided into 5
folds and used for training the parameters of gradient boosting decision tree models
(LightGBM91). Because of the class imbalance of the dataset, deaths were assigned more
weight in the algorithm through a positive class weight of 100. With the trained parameters,
feature selection was performed by assessing 20% of the data (validation set). Second, the
training and validation sets were combined, split into 5 folds and used to train the final
ensemble of 5 models based on the parameters and features optimized in the previous step.
The predictions of the ensemble were combined using the mean of predicted probabilities
and a threshold of 0.5 was used to generate binary classes. The remaining 20% of the data
(test set) was used for the final performance assessment. Binary metrics such as sensitivity,
specificity, the precision-recall area under the curve (PR-AUC) and Mathew correlation
coefficient (MCC) were computed for each predicted week by excluding censored
individuals in the calculations. These last two metrics are recommended when class
imbalance is present where metrics such as accuracy or the area under the receiver
operating characteristic (AUROC) can be biased92. The survival metric used was the
weighted concordance index (C-index) based on the inverse probability of censoring weights
computed across all 12-weeks93. 95% confidence intervals for the performance metrics were
generated by bootstrapping with resampling the generated predictions.
6.8. Explainability of Machine Learning models
As introduced in 4.2, inference has been traditionally used to arrive at biological and clinical
insights. Powerful predictive approaches through ML can be also leveraged for scientific
insights when coupled with strong domain knowledge and tools to interpret ML models. In
Study I, we show based on previous knowledge of HIV host genomics and biological
mechanisms of HLA alleles that ML approaches can be used to augment the existing data
and incorporate functional information. Despite the methods used for HLA imputation and
prediction of their binding affinities to peptides are not transparent since there are based on
bagging techniques and artificial neural networks, their interpretability would only be useful
to understand mechanisms at the SNP and amino acid levels. The accurate predictions of
such models combined with statistical analyses were sufficient to generate meaningful
explanations.
In Study II, methods for model interpretability were required to explain the resulting ML
models based on current clinical knowledge. From the different approaches proposed for
model interpretability94 we opted to use SHapley Additive exPlanations (SHAP). Based on
cooperative game theory, SHAP values provide local explanations with theoretical
guarantees for their local accuracy and consistency95. The explanations provided account
for the contribution of each feature to the predicted value in each individual prediction. To do
so, calculating SHAP values is computationally expensive since accounting for multiple
combinations of features and values is needed. We employed TreeSHAP95, an adaptation of
SHAP for tree-based ML methods to overcome some of these previous limitations by
27
computing SHAP values in polynomial time and accounting for dependency between
features. Apart from providing the contribution of each feature to the hazard function h(t|x)
described in 6.5.3, SHAP values were used for feature selection by calculating the
mean(|SHAP|) for all features and removing those with a value lower than a pre-specified
(0.01 in our case).
6.9. Statistical analyses
Descriptive statistics were employed to summarise the cohorts in both studies and to
generate features or aggregate predictions. While this thesis focused on ML methods,
statistical modelling was employed in Study I to test associations of HLA alleles and functional
HLA clusters with VL. Each node in the resulting dendrogram from consensus clustering
described in 6.6.1 was tested using linear regression adjusting by sex, self-reported race,
and country for any associations with log10-transformed HIV-1 viral load. Due to the number
of tests performed, multiple testing correction19 by a Benjamin-Hochberg procedure was
applied to reduce Type I errors. Associations were considered when an adjusted p-value (q-
value) < 0.05 was observed.
6.10. Software and visualization tools
Open-source tools have been prioritised for conducting the studies outlined in this thesis to
facilitate access and reproducibility of the results. Data wrangling and statistical analyses
were primarily performed using R96 supported by the tidyverse library97. Machine learning
methods have been implemented in Python supported by the pandas98 and numpy99 libraries
using gradient boosting decision trees implemented in LightGBM91. Model performance was
assessed using the libraries scikit-Learn100 and scikit-survival101. Summary tables were
generated using tableone102.
Regarding visualization and dissemination of the results, open-access and interactive results
were prioritized. In Study I, dendrograms and association coefficients were depicted using
Interactive Tree of Life (iTOL)103. Tanglegrams to compare diverse clustering approaches
were generated by the dendextend
R package104. The results are presented through a web
application using shiny105 accessible at bit.ly/HLA_dendogram. Likewise, in Study II, the
resulting ML model and code are available in a public repository at bit.ly/COVIMUN_DT
28
7. Summary of results
The main results of Study I and II are outlined below. For a full description of the methods
and results for each study please find the full manuscripts attached in the Manuscripts
section.
7.1. Study I
Associations of functional HLA class I groups with HIV viral load in a
heterogeneous cohort
Due to the challenges of studying important genotypic variants in multi-ancestry populations,
we assessed if accounting for functional similarities in HLA class I alleles could facilitate
association testing of HLA with HIV-1 viral load (VL) in a heterogeneous cohort. To do so we
employed supervised and unsupervised Machine Learning techniques for HLA imputation,
binding affinity prediction to peptides and consensus clustering of HLA alleles.
HLA alleles were imputed from genotypes of 2546 HIV+ participants. The reported accuracy
of the imputation in out-of-bag samples was >90% for all HLA loci. In a subset of 1122
participants tested for HLA-B*57:01 only 2 false positives and 2 false negatives were
observed based on our imputation. For HLA class I the calling rates were 95.5% for HLA-A,
82.8% for HLA-B and 96.4% for HLA-C as described in detail in a previous study3.
We predicted the binding affinities of 268 HLA class I alleles to 173,792 peptides derived
from 2079 HIV samples and half-million random peptides. We applied consensus clustering
to the generated immunopeptidomes to group HLA alleles based on functional similarities.
The resulting HLA clusters were used to test associations with HIV-1 viral load by linear
regression and adjusted by sex, self-reported race, and country (Figure 3).
Figure 3. Graphical abstract of Study I.
We found four HLA clusters associated with HIV-VL composed of 30 HLA class I alleles of
which 11 were observed in participants from the START cohort. On one hand, two of these
nodes were associated with a lower VL: one cluster composed of HLA-B*57:01, B*58:01,
B*57:02, and B*57:03 -0.25, q-value 7.02E-06) and a second cluster composed of HLA-
29
C*08:04 and C*08:01 -0.29, q-value 0.042). On the other hand, two nodes were
associated with higher VL: one cluster composed of six HLA-B*44 alleles, B*44:05, B*44:08,
B*44:04, B*44:03, B*44:02, B*44:27 (β 0.15, q-value 0.003) and a cluster composed of 16
alleles: B*35:20, B*35:16, B*35:10, B*35:43, B*35:08, B*35:19, B*35:41, B*35:01,
B*35:17, B*35:05, B*44:06, B*56:03, B*53:01, B*15:08 and B*15:11 0.13, q-value 0.048)
(Figure 4). Only two HLA class I alleles from the reported clusters (HLA-B*57:01 and
B*57:03) were associated with HIV-VL when tested independently.
Figure 4. Dendrogram of HLA class I alleles clustered based on binding affinities to HIV
peptides. Associations, defined by an adjusted p-value < 0.05, are represented as thick
branches for nodes and black triangles for leaves. White triangles indicate HLA alleles
imputed in our cohort. The effect of the respective associations is colour-coded from
protective effect (blue) to detrimental (red). Green bars on the outer ring reflect HLA allele
counts. An interactive version is available at bit.ly/HLA_dendogram
While HIV-specific immunopeptidomes were used for the reported associations they only
reflected a small statistical power increase compared to HLA clusters based on random
immunopeptidomes. Overall, the HLA clusters accounted for 11.44% of the explained
variance in HIV VL after adjustment by adjustment for sex, self-reported race, and country
as similarly reported in homogeneous cohorts of European ancestry.















30
7.2. Study II
Personalized survival probabilities for SARS-CoV-2 positive patients by
explainable machine learning
In the context of the COVID-19 pandemic, we developed ML models to predict mortality
within 12 weeks of a first positive SARS-CoV-2 test (FPT). We considered 33,938 patients
who had at least one SARS-CoV-2 RT-PCR positive test performed between the 17th of March
2020 and the 2nd of March 2021. 1,803 (5.34%) deaths were observed of which 141
happened after 12 weeks from FPT.
We generated 2723 features from electronic health records encoding information about
demographics, diagnoses, medications, laboratory test results and vital parameters. After
feature selection, only 22 features were sufficient for the final model. We performed discrete-
time modelling using gradient boosting decision trees (LightGBM91) to account for censoring
of patients for which an FPT was taken after the 8th of December 2020 hence presenting an
uncompleted 12-week follow-up. By modelling the time explicitly in an ML framework, we
could predict patient-specific survival probabilities that could be explained in terms of 22
clinical features and time since FPT (Figure 5)
Figure 5. Graphical abstract of Study II. Panel a. depicts the period and sources represented
in the dataset. Panel b. displays the feature engineering process of values up to a first positive
test. Panel c. illustrates the Machine Learning approach for survival analysis.
   







 




















31
The performance of our model for predicting the risk of death for all 12 weeks measured in
20% of the data (test set) showed a weighted C-index with 95% confidence intervals (CI) of
0.946 (0.941-0.950). At week 12, PR-AUC and MCC with 95% CI were 0.686 (0.651-0.720)
and 0.580 (0.562-0.597) respectively. This performance was reflected in a high
discriminatory power when looking at discrete and cumulative probabilities of death for each
individual aggregated as the median (Figure 6 a, b) or as individual examples (Figure 6 c, d).
Figure 6. Predicted individual discrete and cumulative death probabilities.
When SHAP values were computed for feature selection and model interpretability, high
values of features such as age, BMI, sex (male), and clinical factors such as the number of
unique prescribed medications and diagnosis codes manifested as the top risk factors
(Figure 7a). By including time as a variable following the discrete-time modelling approach
we could explore the temporal dynamics of individual risk factors. Age, ordered loop
diuretics, and admission at the time of FPT had a higher impact on the risk of dying early,
while BMI, diagnosis of Alzheimer’s disease, and ordered B-vitamin were relevant for late risk
(Figure 7b). The proposed ML approach not only captured relevant risk factors that differed
over time but also across individuals (Figure 7c-d).
The model could capture non-linear contributions of the features to the risk of death at 12
weeks. Partial dependency plots (PDP) revealed that age contributes to the risk of death
over 60 years of age (Figure 8a) and a higher risk of mortality in patients with BMI lower than
30 (Figure 8b). A higher risk of death was also seen for patients with low lymphocyte count
(Figure 8d). Finally, we found that the number of ordered medicines was a better predictor
of death than the number of diagnoses, where patients with less than five ordered
medications in the last year showed up to 10% less risk of death compared to patients with
more than 20 ordered medications with up to 40% higher risk (Figure 8h).
32
Figure 7. Global and local explanations of predictive features of 12-week mortality in SARS-
CoV-2 positive patients. The top row depicts agglomerated SHAP values representing global
explanations (a) and weekly global explanations (b). The bottom row illustrates predicted
discrete probabilities of death and the contribution of the features for two patients as an
example of personalized survival probabilities with their corresponding risk factors (c-d)
Figure 8. Partial dependence plots of relevant features in SARS-CoV-2 12-week mortality
prediction by survival status.
33
8. Discussion
The two studies presented in this thesis aim to tackle important challenges in precision
medicine of patients with infectious diseases. The solutions based on Machine Learning
approaches and the findings revealed by explaining the resulting models raise important
methodological, biological and clinical considerations.
8.1. Imputation and functional clustering of HLA alleles by Machine Learning
models
The central role of HLA molecules in the adaptative immune response has been recognised
not only in vitro but also in vivo. The study of HLA alleles in GWAS has been challenging due
to the extreme polymorphism and strong linkage disequilibrium in the HLA loci77. Big sample
sizes are required and difficulties arise when studying populations of diverse ancestry, with
the added problem of the cost of characterising HLA alleles in big cohorts. To facilitate this,
in Study I we imputed HLA alleles for 2,546 HIV+ participants based on SNP array data. The
imputed HLA alleles have proved valuable not only for this study but for the other relevant
scientific contributions listed in this thesis exploring the individual associations of these alleles
with HIV VL3, with clinical outcomes4 or host-viral genomic interactions5. Likewise, multiple
computational approaches have been proposed to investigate HLA alleles in the context of
HIV-1 infection: genome-to-genome analyses106, exploring new epitopes107 or assessing the
interactions of HLA class I molecules to other ligands such as killer immunoglobulin-like
receptors (KIRs)108.
We proposed clustering HLA alleles based on shared function under the reasonable
assumption that the majority of HLA functionality is mediated by their capacity to bind and
present epitopes. This allowed us to account for the high genetic diversity of the cohort and
viral sequences by predicting the binding affinities of HLA class I alleles to HIV-specific and
unspecific immunopeptidomes. Viral diversity was accounted for using a fast k-mer approach
to process all HIV sequences and avoid generating a consensus sequence. This is important
for viruses with high genetic variability such as HIV that are shaped by the selective pressure
inside the host109. Predicted binding affinities to HIV and unspecific peptides were used for
clustering. While clustering based on a Bayesian framework has been presented in a
previous study110, the results were limited by the homogeneity of the data and assumptions
of the clustering parameters. We opted for consensus clustering using different subsets of
immunopeptidomes to avoid bias from the substantial number of non-binders predicted that
can drive the computed distances among HLA alleles towards zero. Using a filtering and
ensemble approach helped us to mitigate issues due to high dimensionality and has been
successful when applied to biological data89.
8.1.1. Associations of functional HLA class I clusters with HIV viral load
A key parameter and potential source of bias in unsupervised learning is the estimation of
the number of final clusters or a threshold to split different entities. The dissimilarity matrices
based on predicted immunopeptidomes from Study I were measured on a continuous scale
hence defining discrete clusters was a problem. To overcome this and test associations of
34
functional clusters with HIV VL we processed the dissimilarity matrices from consensus
clustering by hierarchical clustering allowing us to test associations of each node in the
dendrogram from the leaves (individual HLA class I alleles) up to the roots (HLA functional
nodes). This process combined support for nodes based on their association with VL when
the combination of alleles in all child nodes allowed it. Compared to traditional HLA
association studies challenged by the polymorphism of the locus and diverse ancestries, we
could, by aggregating functionally related alleles, increase the statistical signal and uncover
associations that could not be found at the individual allele due to sample size111. Strict
multiple testing was applied to avoid false positives from the increasing number of statistical
tests.
We found four functional HLA class I nodes associated with HIV-VL accounting for 30 HLA
alleles. Only two of these alleles could be detected when performing independent allele-
specific tests. Some have been previously reported when tested individually112 as for the case
of alleles HLA-B*57:01, B*58:01, B*57:02, and B*57:03. By reporting their common
functionality and combined protective effect we could analyze these alleles together despite
their different frequencies between populations. We also found a protective effect of a
functional node of two alleles, HLA-C*08:04 and C*08:01, confirming findings from other
studies on admixed populations113. While most of our results were in alignment with previous
reports, we found a node of six HLA-B*44 alleles associated with a higher VL that contradicts
a previous study in a Chinese cohort114 for which alleles present in the cluster showed a
protective effect. We hypothesize this could be due to a regional adaptation but the lack of
participants for that region did not allow us to explore this further. Nevertheless, our analysis
could explain in a genetically diverse cohort 11.44% of the explained variance in HIV VL similar
to estimates reported in European74 and multi-ancestry cohorts78. Estimating the contribution of
genetic factors to outcomes of interest while sometimes overlooked is critical to assess the
impact of such variables in precision medicine.
8.2. Explainable Machine Learning for survival models in precision medicine
In the last decade, algorithmic modelling has become more prevalent for predictive tasks
such as prognosis or diagnosis in biomedical research22,37. This has been more evident
during the COVID-19 pandemic where at least 180 ML solutions have been proposed52.
Nevertheless, most of the models in the literature show a high risk of bias due to their design
or poor reporting55,56.
In Study II, we developed a prognostic ML model to predict the risk of death within the first
12 weeks from the first positive SARS-CoV-2 PCR test based on Electronic Health records
of 33,938 patients from Eastern Denmark. We proposed solutions to reported concerns
regarding prognostic models with a focus on model explainability. First, to model longitudinal
data, we implemented a discrete-time approach86,87 that has been shown to achieve
performance as good or better than continuous-time models115,116 such as regularized Cox
models or Random Survival Forests. By doing so, we could model all the individuals in our
cohort, even those for which the outcome was not observed due to a lack of complete follow-
up (12 weeks) at the time when the data extract was generated. This allowed us to avoid
selection bias and underestimation of predicted risks reported in other prognostic models by
not removing censored observations40,57. Because no proportionality of hazards was
35
assumed, our model could predict personalized survival probabilities117 for each patient using
binary classifiers at each time interval, leveraging existing algorithms such as gradient
boosting decision trees91. Furthermore, our model could handle missing values without
imputation and account for the class imbalance due to the low proportion of events in the
model by positive class weighting. Certain areas such as uncertainty estimation of individual
predictions or assessment of calibration could be further improved in our proposed modelling
approach.
To gain insights from the complexity learned by the models118 we used methods for model
interpretability in the context of explainable artificial intelligence (xAI). Most of these methods
operate by removing or altering the values of the features in the model94 to open the “black
box”. Some of these approaches have been used in clinical models for different diseases44,119
including COVID-19120. In some cases, the insights from model explainability revealed biases
and helped improve clinical models39. In Study II, we used SHapley Additive exPlanations
(SHAP)95 values to explain the contribution of the features included in the model to the
predicted survival probability given the specific context of the patient. In addition, since we
included time as a feature, we explain the temporal dynamics of such contributions which
have not been studied previously. We argue that these local explanations are necessary for
the implementation of machine learning models in precision medicine since they reveal
patient-specific risk factors. When combined, global explanations can be derived29,121
generalizing important risk factors involved in the prognosis, in this case, of SARS-CoV-2
positive patients. It is important to mention that the features explained and selected for their
predictive power do not necessarily imply causal effects21,52 of such features in mortality.
While informative, different sets of features could be equally predictive due to equally
performing solutions found during the training of ML models122
8.2.1. Risk factors in SARS-CoV-2+ patients through model explainability
Electronic health records are a valuable resource that can be used to create prognostic
models. Since EHR are optimized for routine care compared to studies in which data is
collected following a protocol, leveraging such information requires careful processing123 of
the outcome to model and encoding of the features to include. In Study II, we chose to model
the outcome of death within the first 12 weeks from the positive SARS-CoV-2 PCR test. All-
cause mortality was the most reliable outcome available in our dataset. While other outcomes
might be more clinically relevant such as hospitalization, oxygen use or admission to an ICU,
the reported cause of these events by COVID-19 was uncertain. Moreover, this reporting
evolved during the pandemic incurring data shifts that even implied changes in national
policy in Denmark124. To provide a more informative outcome we explored sustained
recovery as presented in another study6 which could be used for further predictive models.
A basic approach was used to encode 2723 features from EHR of demographics, laboratory
test results, hospitalizations, vital parameters, diagnoses and medicines. Feature encoding
presents as a combinatorial problem for which multiple solutions have been proposed to
generate meaningful representations125,126. We encoded the latest values or counts in
clinically relevant time windows prior to FPT depending on the data type facilitating the
interpretation of the model. Nevertheless, more optimal solutions could be available by, for
example, including feature trajectories.
36
From explaining our model, we found older age127, sex (male)128 and obesity129 as important
risk factors for mortality in SARS-CoV-2+ patients in alignment with previous studies. Age
was especially relevant in individuals over 60 years old due to age-related factors such as an
increased prevalence of comorbidities that our Machine Learning approach could capture
by modelling interactions. We observed that the number of medications prescribed or
administered was more informative than diagnosis codes suggesting a better proxy for
comorbidities130. Lymphocytopenia was also identified as a relevant risk factor that may
represent an immune dysfunction not only by COVID-19131 but also, by ongoing therapy or
malignancies.
We reported the temporal dynamics and interactions of relevant features. We observed a
higher risk of death in the first four weeks since FPT, probably representing the period during
which the infection was still active47. We could also distinguish between risk factors for early
vs late mortality. Factors such as being hospitalized at the time of FPT, the week since the
start of the pandemic in which the prediction was made, age and administration of loop
diuretics were important factors for early death. Other factors such as lower BMI, diagnosis
of Alzheimer’s disease, and ordered B-vitamin explained the risk of late death (>8 weeks)
probably indicating frail patients with a profound disease progression hence not
recommended for ICU or mechanical ventilation. Some of these features indicate latent
variables not encoded in the features but impacting the outcome such as an improvement
of care due to an evolved understanding of the disease during the pandemic132. To encode
this trend, we included the week since the start of the pandemic as a feature. Patients
infected early rather than later in the pandemic had a higher risk of dying. This reflects the
need to not only focus on the data available in EHR but also to encode meaningful variables
informed by domain knowledge or the environment that could be highly predictive.
8.3. Strengths and limitations
The main strengths of this thesis reside in the methodology proposed and the large cohorts
of HIV+ and SARS-CoV-2+ individuals used for its assessment. The proposed ML
approaches provide solutions to key aspects in the precision medicine of patients with
infectious diseases while maximizing the performance and explainability of the resulting
models.
The functional analysis of HLA alleles in Study I allowed us to study one of the most
polymorphic regions in the human genome with a critical role in infections77, especially HIV133.
We demonstrate that the imputation of HLA alleles through ML is a cost-efficient and
accurate approach to infer HLA types in big and geographically diverse cohorts such as in
the START cohort. Coupled with accurate binding affinity predictions of HLAs to peptides
and consensus clustering to assess functional HLA groups, the presented approach is
neither specific to HIV nor viral load. By providing open access to our results, we facilitate
the study of HLAs and outcomes in other host-pathogen interactions based on already
computed HLA class I functional clusters.
In terms of the methods, some of the limitations of Study I can be found in the assumption of
the functionality of HLA alleles and the approach to infer relevant HLA alleles and functional
clusters. While most of the function of HLA molecules comes from presenting epitopes
37
through the binding of relevant peptides, they can also interact with other ligands that affect
their function such as KIRs which we have not accounted for108. Regarding the inference of
relevant HLA alleles or groups, we used linear models which do not account for non-linear
effects. This model choice is nevertheless the standard approach for testing associations. In
terms of the genetic data available, we lacked participants from Asian countries while also
missing extra participants from more African regions71.
We propose a complete framework for explainable prognostic models based on electronic
health records and Machine learning. As presented in Study II, we could account for non-
linear effects, handle missing values, relax assumptions such as proportional hazards when
modelling right-censored individuals, learn interactions between variables and explain
temporal dynamics of risk factors while improving discriminative performance. To our
knowledge, this modelling approach accounts for many of the drawbacks of previous
models, especially in the context of the COVID-19 pandemic56. By leveraging EHR based on
a data-driven selection of 22 variables our publicly available model could be tested for
implementation enabling precision medicine in routine care.
Some aspects of our proposed modelling framework could be improved. Assessing the
calibration of the models and measuring the uncertainty of individual predictions was
attempted but not accomplished in the discrete-time framework. Arbitrary time intervals were
chosen that could be further optimized to perform the discretization of time. It is worth noting
that the proposed approach requires more effort in processing the data into the required
input for the discrete-time approach. The main limitations of Study II are related to the
available data. Despite representing real-world data (RWD) with big sample sizes, some
uncertainty in EHR only allowed us to trust certain outcomes and conditions for the model to
be reliable. The prediction point at the first positive SARS-CoV-2 test limited the usage of the
model at the following times. Also, the lack of confirmations of the cause of death in SARS-
CoV-2+ individuals only permitted us to model all-cause mortality instead of those caused by
COVID-19. Due to the fast evolution of the pandemic, data shifts were present134, for
example, our model did not account for vaccinated individuals which now corresponds to
most of the population in Denmark. These limitations nevertheless can be solved by
retraining the models and validating them in external datasets to assess the generalizability
of predictions in other healthcare systems.
38
9. Conclusion
The work conducted in this thesis demonstrates the implementation of Machine Learning
methods for precision medicine of patients with infectious diseases. We presented in two
studies how these novel methods can be used to study challenging genetic regions in diverse
populations and to generate precise prognostic models. When supported by model
explainability, ML models not only provided accurate predictions but also insights into
complex mechanisms involved in HIV genetics and disease progression in SARS-CoV-2
infections.
In Study I, we imputed HLA alleles and predicted their binding affinities to HIV and unspecific
peptides. This allowed us to group functionally related HLA alleles that would be differentially
represented across populations hence facilitating association testing of HLA alleles with HIV
viral load in cohorts of diverse ancestry.
In Study II, we employed electronic health records to develop data-driven models of 12-week
mortality in SARS-CoV-2 positive patients. We show how discrete-time modelling can be
leveraged by existing approaches for binary classification. Modelling the time explicitly as a
variable allowed us to unravel the temporal dynamics of personalized risk factors for each
patient. This allowed us to not only train survival models that account for censored patients
with great performance but also accommodate existing techniques for model explainability.
While the ML approaches presented in this thesis have been applied in the context of HIV
and SARS-CoV-2 infections, the study of HLA alleles through their function and the
development of explainable survival models can be applied to other diseases. The
methodology presented not only allows for a deeper understanding of diseases by explaining
the resulting models but also addresses two of the major tenets in precision medicine: the
estimation of patient-specific risks and refined characterisations of individuals by accounting
for relevant genetic factors in clinical models.
10. Future perspectives
Given the stage of development in which precision medicine is still immersed, multiple
considerations can be taken regarding its future. One of the main concerns is the challenge
to translate the findings from the laboratory to clinical practice7,60,135, a cycle colloquially
denominated “from the bench to the bedside”. A lot of the roadblocks to precision medicine
can be found in the lack of logistical and educational progress in the field. To overcome these
issues, multiple national initiatives are focusing on building the necessary infraestructures11,59
to support the richer data from the high-throughput characterisation of patients and data
lakes to enable data-drivel clinical models as in the case of PERSIMUNE136. Furthermore,
new educational programs on precision medicine are available for clinical practitioners and
researchers137. Nevertheless, important aspects of clinical research regarding
epistemological assumptions and methodological considerations9 have to be improved to
grant robust clinical models that could sustain their transition into the clinic. These aspects
relate to how and what to model in precision medicine.
39
10.1. How to model: from predictive to causal models
The increasing use of algorithmic models in statistical modelling and Machine Learning is not
only improving the predictive power of clinical models but also their acceptance into clinical
practice when aided by methods for model explainability. Prospective assessment of such
models and benchmarking against standard of care and medical practitioners are still
required to understand their clinical impact37. Some studies have shown some success by
algorithmic models reaching the accuracy of medical experts138. However, when assessed
prospectively, models trained in retrospective data can experience distribution shifts and
loose performance over time134. An explanation for this can be found in the way that statistical
modelling and Machine learning has been designed so far. Modelling based on correlations
or associations has left the assessment of causality for experiments and clinical trials. In
recent years, causal inference is gaining popularity as an approach to quantifying the causal
effects of variables of interest. In the context of genomics, Mendelian randomization has been
proposed to measure the causal effects of genetic variation on different outcomes139. But
causal principles can be also applied in other areas. Principles of causal understanding such
as counterfactuals can be used in conjunction with Machine Learning algorithms. This ties
closely with the field of model explainability since most of our explanations are based on
counterfactual thinking29. When these principles are applied, they can provide more robust
and accurate clinical models140 that could stand the process of deployment in clinical
practice.
10.2. What to model: data-driven models and complexity
Traditional clinical models have been developed by considering a restricted set of variables
of interest, selected based on previous domain knowledge. These variables were usually
chosen from available data in routine care or studies exploring the explanatory potential of
new variables. If new clinical variables improved the existing models, these would be then
measured and incorporated into routine care. With the digitalization of medical records and
the accessibility of high-throughput techniques, the amount of available data is growing
exponentially. Big data processed by algorithms such as in Machine Learning enables a new
way of modelling, data-driven, where all available information can be used to predict an
outcome of interest. The selection of variables can be then informed by the algorithm by
selecting the most informative ones that could later be explained into new insights. This
approach of considering all available data when developing models can also inspire new
ways to not only account for all the already collected information but find ways to encode
complex variables that reflect environmental or psychosocial features. The need for
modelling complexity in medicine13 can be supported by studies quantifying the impact of
genetic and environmental factors on diverse traits in which an overestimation of heritability
in complex diseases has been reported141. In some cases, environmental or unknown non-
genetic factors can account for up to 70% of variability as in the case of HIV set-point viral
load133 Exploring non-genetic variables of clinical relevance such as environmental and
psychosocial factors not only could improve the accuracy of clinical models but also expose
new candidates for interventions. By better understanding individuals and their environment
is that precision medicine could manifest the etymological roots of health referring to
“whole”, providing a complete and inclusive picture of the complexity of medicine.
40
11. References
1. Zucco, A. G.
et al.
Associations of functional HLA class I groups with HIV viral load in a
heterogeneous cohort. 2022.06.21.22276431 Preprint at
https://doi.org/10.1101/2022.06.21.22276431 (2022).
2. Zucco, A. G.
et al.
Personalized survival probabilities for SARS-CoV-2 positive patients by
explainable machine learning. 2021.10.28.21265598 Preprint at
https://doi.org/10.1101/2021.10.28.21265598 (2021).
3. Ekenberg, C.
et al.
Association Between Single-Nucleotide Polymorphisms in HLA Alleles and
Human Immunodeficiency Virus Type 1 Viral Load in Demographically Diverse, Antiretroviral
TherapyNaive Participants From the Strategic Timing of AntiRetroviral Treatment Trial.
J
Infect Dis
220, 13251334 (2019).
4. Ekenberg, C.
et al.
The association of human leukocyte antigen alleles with clinical disease
progression in HIV-positive cohorts with varied treatment strategies.
AIDS
35, 783789
(2021).
5. Gabrielaite, M.
et al.
Human immunotypes impose selection on viral genotypes through viral
epitope specificity.
The Journal of Infectious Diseases
(2021) doi:10.1093/infdis/jiab253.
6. Moestrup, K. S.
et al.
Readmissions, post-discharge mortality and sustained recovery among
patients admitted to hospital with COVID-19.
7. McGrath, S. & Ghersi, D. Building towards precision medicine: empowering medical
professionals for the next revolution.
BMC Medical Genomics
9, 23 (2016).
8. König, I. R., Fuchs, O., Hansen, G., Mutius, E. von & Kopp, M. V. What is precision medicine?
European Respiratory Journal
50, (2017).
9. Gombar, S., Callahan, A., Califf, R., Harrington, R. & Shah, N. H. It is time to learn from patients
like mine.
npj Digit. Med.
2, 13 (2019).
10. Hasin, Y., Seldin, M. & Lusis, A. Multi-omics approaches to disease.
Genome Biology
18, 83
(2017).
11. Njølstad, P. R.
et al.
Roadmap for a precision-medicine initiative in the Nordic region.
Nature
Genetics
1 (2019) doi:10.1038/s41588-019-0391-1.
12. Time to reality check the promises of machine learning-powered precision medicine.
The
Lancet Digital Health
2, e677e680 (2020).
13. Miles, A. Complexity in medicine and healthcare: people and systems, theory and practice.
Journal of Evaluation in Clinical Practice
15, 409410 (2009).
14. Shmueli, G. To Explain or to Predict?
Statist. Sci.
25, 289310 (2010).
15. James, G., Witten, D., Hastie, T. & Tibshirani, R.
An introduction to statistical learning
. vol. 112
(Springer, 2013).
16. Breiman, L. Statistical Modeling: The Two Cultures (with comments and a rejoinder by the
author).
Statist. Sci.
16, 199231 (2001).
17. Clarke, G. M.
et al.
Basic statistical analysis in genetic case-control studies.
Nature Protocols
6,
121133 (2011).
18. Baker, M. Statisticians issue warning over misuse of
P
values.
Nature News
531, 151 (2016).
19. Goeman, J. J. & Solari, A. Tutorial in biostatistics: multiple hypothesis testing in genomics. in
(2012).
20. Turing, A. M. On Computable Numbers, with an Application to the Entscheidungsproblem.
Proceedings of the London Mathematical Society
s2-42, 230265 (1937).
21. Goodfellow, I., Bengio, Y. & Courville, A.
Deep Learning
. (MIT Press, 2016).
22. Yu, K.-H., Beam, A. L. & Kohane, I. S. Artificial intelligence in healthcare.
Nature Biomedical
Engineering
2, 719 (2018).
23. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning.
Nature
521, 436444 (2015).
41
24. Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating
errors.
Nature
323, 533536 (1986).
25. Hastie, T., Tibshirani, R., Friedman, J. H. & Friedman, J. H.
The elements of statistical learning:
data mining, inference, and prediction
. vol. 2 (Springer, 2009).
26. Wang, P., Li, Y. & Reddy, C. K. Machine Learning for Survival Analysis: A Survey.
arXiv:1708.04649 [cs, stat]
(2017).
27. Roscher, R., Bohn, B., Duarte, M. F. & Garcke, J. Explainable Machine Learning for Scientific
Insights and Discoveries.
IEEE Access
8, 4220042216 (2020).
28. Kaminski, M. E. & Malgieri, G. Algorithmic impact assessments under the GDPR: producing
multi-layered explanations.
International Data Privacy Law
11, 125144 (2021).
29. Miller, T. Explanation in artificial intelligence: Insights from the social sciences.
Artificial
Intelligence
267, 138 (2019).
30. Lipton, Z. C. The Mythos of Model Interpretability.
arXiv:1606.03490 [cs, stat]
(2016).
31. Belle, V. & Papantonis, I. Principles and Practice of Explainable Machine Learning.
arXiv:2009.11698 [cs, stat]
(2020).
32. Udrescu, S.-M. & Tegmark, M. AI Feynman: A physics-inspired method for symbolic
regression.
Science Advances
6, eaay2631 (2020).
33. Montavon, G., Samek, W. & Müller, K.-R. Methods for Interpreting and Understanding Deep
Neural Networks.
arXiv:1706.07979 [cs, stat]
(2017).
34. Ribeiro, M. T., Singh, S. & Guestrin, C. ‘Why Should I Trust You?’: Explaining the Predictions of
Any Classifier.
arXiv:1602.04938 [cs, stat]
(2016).
35. Lundberg, S. & Lee, S.-I. A Unified Approach to Interpreting Model Predictions.
arXiv:1705.07874 [cs, stat]
(2017).
36. Wang, F., Kaushal, R. & Khullar, D. Should Health Care Demand Interpretable Artificial
Intelligence or Accept “Black Box” Medicine?
Ann Intern Med
172, 5960 (2020).
37. Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence.
Nature Medicine
25, 44 (2019).
38. Cadario, R., Longoni, C. & Morewedge, C. K. Understanding, explaining, and utilizing medical
artificial intelligence.
Nat Hum Behav
5, 16361642 (2021).
39. Caruana, R.
et al.
Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital
30-day Readmission. in
Proceedings of the 21th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining
17211730 (ACM, 2015).
doi:10.1145/2783258.2788613.
40. Li, Y., Sperrin, M., Ashcroft, D. M. & Staa, T. P. van. Consistency of variety of machine learning
and statistical models in predicting clinical risks of individual patients: longitudinal cohort study
using cardiovascular disease as exemplar.
BMJ
371, (2020).
41. Lakhani, P. & Sundaram, B. Deep Learning at Chest Radiography: Automated Classification of
Pulmonary Tuberculosis by Using Convolutional Neural Networks.
Radiology
284, 574582
(2017).
42. Wang, M., Wei, Z., Jia, M., Chen, L. & Ji, H. Deep learning model for multi-classification of
infectious diseases from unstructured electronic medical records.
BMC Medical Informatics
and Decision Making
22, 41 (2022).
43. Goh, K. H.
et al.
Artificial intelligence in sepsis early prediction and diagnosis using unstructured
data in healthcare.
Nat Commun
12, 711 (2021).
44. Agius, R.
et al.
Machine learning can identify newly diagnosed patients with CLL at high risk of
infection.
Nature Communications
11, 117 (2020).
45. Ardito, L., Coccia, M. & Messeni Petruzzelli, A. Technological exaptation and crisis
management: Evidence from COVID19 outbreaks.
R&D Management
10.1111/radm.12455
(2021) doi:10.1111/radm.12455.
42
46. Organization, W. H. & others. Weekly epidemiological update on COVID-19 - 8 June 2022.
CoV-weekly-sitrep8Jun21-eng. pdf. pdf
.
47. Matheson, N. J. & Lehner, P. J. How does SARS-CoV-2 cause COVID-19?
Science
369, 510
511 (2020).
48. Sah, P.
et al.
Asymptomatic SARS-CoV-2 infection: A systematic review and meta-analysis.
Proceedings of the National Academy of Sciences
118, e2109229118 (2021).
49. Lechien, J. R.
et al.
Clinical and epidemiological characteristics of 1420 European patients with
mild-to-moderate coronavirus disease 2019.
Journal of Internal Medicine
288, 335344
(2020).
50. Clinical characteristics of COVID-19.
European Centre for Disease Prevention and Control
https://www.ecdc.europa.eu/en/covid-19/latest-evidence/clinical.
51. Reilev, M.
et al.
Characteristics and predictors of hospitalization and death in the first 11 122
cases with a positive RT-PCR test for SARS-CoV-2 in Denmark: a nationwide cohort.
International Journal of Epidemiology
49, 14681481 (2020).
52. Syrowatka, A.
et al.
Leveraging artificial intelligence for pandemic preparedness and response:
a scoping review to identify key use cases.
npj Digit. Med.
4, 114 (2021).
53. Roberts, M.
et al.
Common pitfalls and recommendations for using machine learning to detect
and prognosticate for COVID-19 using chest radiographs and CT scans.
Nat Mach Intell
3,
199217 (2021).
54. Izcovich, A.
et al.
Prognostic factors for severity and mortality in patients infected with COVID-
19: A systematic review.
PLOS ONE
15, e0241955 (2020).
55. Wynants, L.
et al.
Prediction models for diagnosis and prognosis of covid-19 infection:
systematic review and critical appraisal.
BMJ
369, (2020).
56. Navarro, C. L. A.
et al.
Risk of bias in studies on prediction models developed using supervised
machine learning techniques: systematic review.
BMJ
375, n2281 (2021).
57. Vock, D. M.
et al.
Adapting machine learning techniques to censored time-to-event health
record data: A general-purpose approach using inverse probability of censoring weighting.
Journal of Biomedical Informatics
61, 119131 (2016).
58. Milo Rasouly, H., Aggarwal, V., Bier, L., Goldstein, D. B. & Gharavi, A. G. Cases in Precision
Medicine: Genetic Testing to Predict Future Risk for Disease in a Healthy Patient.
Ann Intern
Med
174, 540547 (2021).
59. Collins, F. S. & Varmus, H. A New Initiative on Precision Medicine.
New England Journal of
Medicine
372, 793795 (2015).
60. Lancet, T. 20 years of precision medicine in oncology.
The Lancet
397, 1781 (2021).
61. Caliendo, A. M.
et al.
Better Tests, Better Care: Improved Diagnostics for Infectious Diseases.
Clin Infect Dis
57, S139S170 (2013).
62. Chavda, V. P., Patel, A. B. & Vaghasiya, D. D. SARS-CoV-2 variants and vulnerability at the
global level.
Journal of Medical Virology
94, 29863005 (2022).
63. Moser, C.
et al.
Antibiotic therapy as personalized medicine general considerations and
complicating factors.
APMIS
127, 361371 (2019).
64. Rello, J.
et al.
Towards precision medicine in sepsis: a position paper from the European
Society of Clinical Microbiology and Infectious Diseases.
Clinical Microbiology and Infection
24, 12641272 (2018).
65. Ladner, J. T., Grubaugh, N. D., Pybus, O. G. & Andersen, K. G. Precision epidemiology for
infectious disease control.
Nat Med
25, 206211 (2019).
66. Thorball, C. W., Fellay, J. & Borghesi, A. Immunological lessons from genome-wide association
studies of infections.
Current Opinion in Immunology
72, 8793 (2021).
67. Tian, C.
et al.
Genome-wide association and HLA region fine-mapping studies identify
susceptibility loci for multiple common infections.
Nature Communications
8, 599 (2017).
43
68. Barré-Sinoussi, F.
et al.
Isolation of a T-Lymphotropic Retrovirus from a Patient at Risk for
Acquired Immune Deficiency Syndrome (AIDS).
Science
220, 868871 (1983).
69. HIV/AIDS. https://www.who.int/health-topics/hiv-aids.
70. Deeks, S. G., Overbaugh, J., Phillips, A. & Buchbinder, S. HIV infection.
Nat Rev Dis Primers
1,
122 (2015).
71. Insight Start Study Group. Initiation of Antiretroviral Therapy in Early Asymptomatic HIV
Infection.
New England Journal of Medicine
373, 795807 (2015).
72. Rodger, A. J.
et al.
Risk of HIV transmission through condomless sex in serodifferent gay
couples with the HIV-positive partner taking suppressive antiretroviral therapy (PARTNER):
final results of a multicentre, prospective, observational study.
The Lancet
393, 24282438
(2019).
73. Naranbhai, V. & Carrington, M. Host genetic variation and HIV disease: from mapping to
mechanism.
Immunogenetics
69, 489498 (2017).
74. Bartha, I.
et al.
Estimating the Respective Contributions of Human and Viral Genetic Variation to
HIV Control.
PLOS Computational Biology
13, e1005339 (2017).
75. Fraser, C.
et al.
Virulence and Pathogenesis of HIV-1 Infection: An Evolutionary Perspective.
Science
343, (2014).
76. Sirugo, G., Williams, S. M. & Tishkoff, S. A. The Missing Diversity in Human Genetic Studies.
Cell
177, 2631 (2019).
77. Dendrou, C. A., Petersen, J., Rossjohn, J. & Fugger, L. HLA variation and disease.
Nat Rev
Immunol
18, 325339 (2018).
78. Luo, Y.
et al.
A high-resolution HLA reference panel capturing global population diversity
enables multi-ancestry fine-mapping in HIV host response.
Nat Genet
53, 15041516 (2021).
79. Moshiri, N. ViralMSA: massively scalable reference-guided multiple sequence alignment of viral
genomes.
Bioinformatics
37, 714716 (2021).
80. Mapleson, D., Garcia Accinelli, G., Kettleborough, G., Wright, J. & Clavijo, B. J. KAT: a K-mer
analysis toolkit to quality control NGS datasets and genome assemblies.
Bioinformatics
33,
574576 (2017).
81. Jurtz, V.
et al.
NetMHCpan-4.0: Improved PeptideMHC Class I Interaction Predictions
Integrating Eluted Ligand and Peptide Binding Affinity Data.
The Journal of Immunology
199,
33603368 (2017).
82. Pappas, D. J.
et al.
Significant variation between SNP-based HLA imputations in diverse
populations: the last mile is the hardest.
The Pharmacogenomics Journal
18, 367376 (2018).
83. Zheng, X.
et al.
HIBAGHLA genotype imputation with attribute bagging.
The
Pharmacogenomics Journal
14, 192200 (2014).
84. Reynisson, B., Alvarez, B., Paul, S., Peters, B. & Nielsen, M. NetMHCpan-4.1 and
NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif
deconvolution and integration of MS MHC eluted ligand data.
Nucleic Acids Research
48,
W449W454 (2020).
85. Thomsen, M., Lundegaard, C., Buus, S., Lund, O. & Nielsen, M. MHCcluster, a method for
functional clustering of MHC molecules.
Immunogenetics
65, 655665 (2013).
86. Cox, D. R. Regression Models and Life-Tables.
Journal of the Royal Statistical Society. Series B
(Methodological)
34, 187220 (1972).
87. Tutz, G. & Schmid, M.
Modeling Discrete Time-to-Event Data
. (Springer International
Publishing, 2016). doi:10.1007/978-3-319-28158-2.
88. van Dongen, S. & Enright, A. J. Metric distances derived from cosine similarity and Pearson
and Spearman correlations.
arXiv:1208.3145 [cs, stat]
(2012).
89. Ronan, T., Qi, Z. & Naegle, K. M. Avoiding common pitfalls when clustering biological data.
Sci.
Signal.
9, re6re6 (2016).
44
90. Strehl, A. & Ghosh, J. Cluster Ensembles a Knowledge Reuse Framework for Combining
Multiple Partitions.
J. Mach. Learn. Res.
3, 583617 (2003).
91. Ke, G.
et al.
LightGBM: A Highly Efficient Gradient Boosting Decision Tree.
Advances in Neural
Information Processing Systems
30, (2017).
92. Chicco, D. & Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over
F1 score and accuracy in binary classification evaluation.
BMC Genomics
21, 6 (2020).
93. Uno, H., Cai, T., Pencina, M. J., D’Agostino, R. B. & Wei, L. J. On the C-statistics for evaluating
overall adequacy of risk prediction procedures with censored survival data.
Stat Med
30,
11051117 (2011).
94. Covert, I., Lundberg, S. & Lee, S.-I. Explaining by Removing: A Unified Framework for Model
Explanation.
Journal of Machine Learning Research
22, 190 (2021).
95. Lundberg, S. M.
et al.
From local explanations to global understanding with explainable AI for
trees.
Nat Mach Intell
2, 5667 (2020).
96. R Core Team.
R: A Language and Environment for Statistical Computing
. (R Foundation for
Statistical Computing, 2019).
97. Wickham, H.
et al.
Welcome to the tidyverse.
Journal of Open Source Software
4, 1686 (2019).
98. team, T. pandas development. pandas-dev/pandas: Pandas 1.3.3. (2021)
doi:10.5281/zenodo.5501881.
99. Harris, C. R.
et al.
Array programming with NumPy.
Nature
585, 357362 (2020).
100. Pedregosa, F.
et al.
Scikit-learn: Machine Learning in Python.
Journal of Machine Learning
Research
12, 28252830 (2011).
101. Pölsterl, S. scikit-survival: A Library for Time-to-Event Analysis Built on Top of scikit-learn.
Journal of Machine Learning Research
21, 16 (2020).
102. Pollard, T. J., Johnson, A. E. W., Raffa, J. D. & Mark, R. G. tableone: An open source Python
package for producing summary statistics for research papers.
JAMIA Open
1, 2631 (2018).
103. Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v4: recent updates and new
developments.
Nucleic Acids Res
47, W256W259 (2019).
104. Galili, T. dendextend: an R package for visualizing, adjusting and comparing trees of
hierarchical clustering.
Bioinformatics
31, 37183720 (2015).
105. Chang, W.
et al.
shiny: Web Application Framework for R
. (2021).
106. Bartha, I.
et al.
A genome-to-genome analysis of associations between human genetic
variation, HIV-1 sequence diversity, and viral control.
eLife Sciences
2, e01123 (2013).
107. Arora, J.
et al.
HIV peptidome-wide association study reveals patient-specific epitope
repertoires associated with HIV control.
PNAS
116, 944949 (2019).
108. Debebe, B. J.
et al.
Identifying the immune interactions underlying HLA class I disease
associations.
eLife
9, e54558 (2020).
109. Fellay, J. & Pedergnana, V. Exploring the interactions between the human and viral genomes.
Hum Genet
(2019) doi:10.1007/s00439-019-02089-3.
110. Dorp, C. van & Kesmir, C. Estimating HLA disease associations using similarity trees.
bioRxiv
408302 (2018) doi:10.1101/408302.
111. Kennedy, A. E., Ozbek, U. & Dorak, M. T. What has GWAS done for HLA and disease
associations?
International Journal of Immunogenetics
44, 195211 (2017).
112. Goulder, P. J. R. & Walker, B. D. HIV and HLA Class I: An Evolving Relationship.
Immunity
37,
426440 (2012).
113. Valenzuela-Ponce, H.
et al.
Novel HLA class I associations with HIV-1 control in a unique
genetically admixed population.
Scientific Reports
8, 6111 (2018).
114. Zhang, X.
et al.
HLA-B*44 Is Associated with a Lower Viral Set Point and Slow CD4 Decline in
a Cohort of Chinese Homosexual Men Acutely Infected with HIV-1.
Clin Vaccine Immunol
20,
10481054 (2013).
45
115. Kvamme, H. & Borgan, Ø. Continuous and Discrete-Time Survival Prediction with Neural
Networks.
arXiv:1910.06724 [cs, stat]
(2019).
116. Sloma, M., Syed, F., Nemati, M. & Xu, K. S. Empirical Comparison of Continuous and Discrete-
time Representations for Survival Prediction. in
Proceedings of AAAI Spring Symposium on
Survival Prediction - Algorithms, Challenges, and Applications 2021
118131 (PMLR, 2021).
117. Haider, H., Hoehn, B., Davis, S. & Greiner, R. Effective Ways to Build and Evaluate Individual
Survival Distributions.
Journal of Machine Learning Research
21, 163 (2020).
118. Roscher, R., Bohn, B., Duarte, M. F. & Garcke, J. Explainable Machine Learning for Scientific
Insights and Discoveries.
arXiv:1905.08883 [cs, stat]
(2019).
119. Lauritsen, S. M.
et al.
Explainable artificial intelligence model to predict acute critical illness
from electronic health records.
Nature Communications
11, 3852 (2020).
120. Yan, L.
et al.
An interpretable mortality prediction model for COVID-19 patients.
Nature
Machine Intelligence
16 (2020) doi:10.1038/s42256-020-0180-7.
121. Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R. & Yu, B. Interpretable machine
learning: definitions, methods, and applications.
arXiv:1901.04592 [cs, stat]
(2019).
122. Molnar, C.
et al.
General Pitfalls of Model-Agnostic Interpretation Methods for Machine
Learning Models.
arXiv:2007.04131 [cs, stat]
(2021).
123. Jensen, P. B., Jensen, L. J. & Brunak, S. Mining electronic health records: towards better
research applications and clinical care.
Nature Reviews Genetics
13, 395405 (2012).
124. Statens Serum Institut. Typical misinformation regarding Danish COVID-numbers.
https://en.ssi.dk/covid-19/typical-misinformation-regarding-danish-covid-numbers.
125. Landi, I.
et al.
Deep representation learning of electronic health records to unlock patient
stratification at scale.
npj Digit. Med.
3, 111 (2020).
126. Li, Y.
et al.
BEHRT: Transformer for Electronic Health Records.
Sci Rep
10, 7155 (2020).
127. Gao, Y. dong
et al.
Risk factors for severe and critically ill COVID-19 patients: A review.
Allergy: European Journal of Allergy and Clinical Immunology
76, 428455 (2021).
128. Zhang, J.
et al.
Risk factors for disease severity, unimprovement, and mortality in COVID-19
patients in Wuhan, China.
Clin Microbiol Infect
26, 767772 (2020).
129. Gao, F.
et al.
Obesity Is a Risk Factor for Greater COVID-19 Severity.
Diabetes Care
(2020)
doi:10.2337/dc20-0682.
130. Guan, W.
et al.
Comorbidity and its impact on 1590 patients with COVID-19 in China: a
nationwide analysis.
European Respiratory Journal
55, (2020).
131. Cippà, P. E.
et al.
A data-driven approach to identify risk profiles and protective drugs in
COVID-19.
PNAS
118, (2021).
132. Benfield, T.
et al.
Improved Survival Among Hospitalized Patients With Coronavirus Disease
2019 (COVID-19) Treated With Remdesivir and Dexamethasone. A Nationwide Population-
Based Cohort Study.
Clinical Infectious Diseases
(2021) doi:10.1093/cid/ciab536.
133. Tough, R. H. & McLaren, P. J. Interaction of the Host and Viral Genome and Their Influence on
HIV Disease.
Front. Genet.
9, (2019).
134. Zhang, A., Xing, L., Zou, J. & Wu, J. C. Shifting machine learning for healthcare from
development to deployment and from models to data.
Nat. Biomed. Eng
116 (2022)
doi:10.1038/s41551-022-00898-y.
135. Burgner, D., Jamieson, S. E. & Blackwell, J. M. Genetic susceptibility to infectious diseases:
big is beautiful, but will bigger be even better?
Lancet Infect Dis
6, 653663 (2006).
136. Centre of excellence for personalised medicine of infectious complications in immune
deficiency. https://www.persimune.dk/.
137. Videreuddannelse, S. Master i personlig medicin. https://personligmedicin.ku.dk/ (2020).
138. Gunčar, G.
et al.
An application of machine learning to haematological diagnosis.
Scientific
Reports
8, 411 (2018).
46
139. Sanderson, E.
et al.
Mendelian randomization.
Nat Rev Methods Primers
2, 121 (2022).
140. Richens, J. G., Lee, C. M. & Johri, S. Improving the accuracy of medical diagnosis with causal
machine learning.
Nature Communications
11, 3923 (2020).
141. Muñoz, M.
et al.
Evaluating the contribution of genetics and familial shared environment to
common disease using the UK Biobank.
Nat Genet
48, 980983 (2016).
47
12. Manuscripts
Manuscript of Study I
Associations of functional HLA class I groups with HIV viral load in a
heterogeneous cohort.1
Adrian G. Zucco, Marc Bennedbæk, Christina Ekenberg, Migle Gabrielaite, Preston Leung,
Mark N. Polizzotto, Virginia Kan, Daniel D. Murray, Jens D. Lundgren and Cameron R.
MacPherson for the INSIGHT START study group.
medRxiv, June 2022.
(Submitted to AIDS)
1
Associations of functional HLA class I groups with HIV viral load in a
heterogeneous cohort.
Adrian G. ZUCCO1, Marc BENNEDBÆK2, Christina EKENBERG1, Migle GABRIELAITE3, Preston LEUNG1, Mark N.
POLIZZOTTO4, Virginia KAN5, Daniel D. MURRAY1, Jens D. LUNDGREN1 and Cameron R. MACPHERSON1 for
the INSIGHT START study group.
1PERSIMUNE Center of Excellence, Rigshospitalet, Copenhagen, Denmark
2Virus Research and Development Laboratory, Virus and Microbiological Special Diagnostics, Statens Serum
Institut, Copenhagen, Denmark
3Center for Genomic Medicine, Copenhagen University Hospital, Copenhagen, Denmark.
4Clinical Hub for Interventional Research, College of Health and Medicine, The Australian National University,
Canberra, Australia
5George Washington University, Veterans Affairs Medical Center, Washington D.C, U.S.A.
Corresponding author
Adrian Gabriel Zucco, MSc, PhD student
Tel: +45 35 45 57 75
mail: adrian.gabriel.zucco@regionh.dk
Rigshospitalet, Copenhagen University Hospital
Centre of Excellence for Health, Immunity and Infections (CHIP) & PERSIMUNE
Blegdamsvej 9, DK-2100 Copenhagen Ø, Denmark
2
Abstract
Human Leucocyte Antigen (HLA) class I alleles are the main host genetic factors involved in controlling HIV-1
viral load (VL). Nevertheless, HLA diversity has proven a significant challenge in association studies. We
assessed how accounting for binding affinities of HLA class I alleles to HIV-1 peptides facilitate association
testing of HLA with HIV-1 VL in a heterogeneous cohort from the Strategic Timing of AntiRetroviral Treatment
(START) study. We imputed HLA class I alleles from host genetic data (2,546 HIV+ participants) and sampled
immunopeptidomes from 2,079 host-paired viral genomes (targeted amplicon sequencing). We predicted
HLA class I binding affinities to HIV-1 and unspecific peptides, grouping alleles into functional clusters through
consensus clustering. These functional HLA class I clusters were used to test associations with HIV VL. We
identified four clades totalling 30 HLA alleles accounting for 11.4% variability in VL. We highlight HLA-B*57:01
and B*57:03 as functionally similar but yet overrepresented in distinct ethnic groups, showing when
combined a protective association with HIV+ VL (log, β -0.25; adj. p-value < 0.05). We further demonstrate
only a slight power reduction when using unspecific immunopeptidomes, facilitating the use of the inferred
functional HLA groups in other studies. The outlined computational approach provides a robust and efficient
way to incorporate HLA function and peptide diversity, aiding clinical association studies in heterogeneous
cohorts. To facilitate access to the proposed methods and results we provide an interactive application for
exploring data.
3
Introduction
The Human Leucocyte Antigen (HLA) is a critical component of the host immune response. HLA Class I alleles
mediate the anti-viral response through the presentation of intracellular viral peptides for recognition by
Cytotoxic T cells. This mechanism is critical to the host’s defence against diverse pathogens, which is why it
is among the most genetically diverse regions in the human genome as evidenced by its association with our
variable response to infectious disease [1]. In the context of HIV, HLA alleles are the canonical host genetic
factors associated with viral load (VL)[2], altogether contributing up to 12% of its variability [3]. Viral diversity,
on the other hand, is thought to explain two to four times as much (20%-46%) [4]. The combined relevance
of both host and viral genomics suggests the necessity for the simultaneous analysis of both in the context
of infectious disease[5]. Different approaches have been used to study this interaction such as genome-to-
genome [6] or peptidome-wide associations [7]. In addition, when considering the interplay between host
HLA alleles and viral peptides, their association could be explained functionally in terms of epitope binding
and presentation.
Analysis of genetic variance within the HLA region and its associations with clinical outcomes is challenged
by population-dependent distributions and the diversity of HLA haplotypes. Reaching statistical power within
this region is difficult and while accounting for the effects of this important immunological region is
necessary, it is often ignored altogether. Yet, the function of HLA is conserved as a core component of the
immune response [8], [9]. This suggests there may be shared functional features within and between
populations, even in the presence of genotypic plasticity. Understanding the structure of this mutual
information could prove relevant for the study of host-pathogen interactions where high mutation rates are
observed such as in HIV [10].
Immunopeptidomes have been used to estimate functional similarities between HLA alleles in what was
initially denominated HLA supertypes [11]. These functional groups are based on the propensity of various
alleles to bind similar sets of peptides. The algorithms used to estimate these binding profiles have gradually
improved over time [12], [13]. However, the consideration of HLA functional groups in the context of
Genome-Wide Association Studies (GWAS) has largely been overlooked [14].
In this study, we considered the ability of HLA proteins to functionally bind peptides leveraging the cost-
effectiveness of high-throughput genotyping as opposed to directly assaying HLA function. We provide a
framework based on state-of-the-art computational methods for the study of predicted immunopeptidomes
and functional HLA groups in the context of HIV-1 infection. In doing so, we demonstrate the increased
statistical power that is gained by moving away from a purely genetic approach to a functional one with
implications for studying the immune response either directly or as a confounder. We processed HIV
immunopeptidomes that incorporate intra-host viral diversity and imputed HLA alleles from the same
geographically diverse cohort of people living with HIV (PLWH) and antiretroviral therapy (ART) naïve
participants from the Strategic Timing of Anti-Retroviral Treatment (START) trial [16]. Interactive results of
this work and complementary data are available via a web application ( https://persimune-health-
informatics.shinyapps.io/PAW2022Zucco__HLA_HIV_INSIGHT/).
Methods
Ethics
Host and viral samples in this study were extracted and analyzed from participants in the START clinical trial
(NCT00867048) [16], conducted by the International Network for Strategic Initiatives in Global HIV Trials
(INSIGHT) and the Community Programs for Clinical Research on AIDS (CPCRA). Written consent for the study
4
and genetic analyses were obtained from the participants and approved by participants’ site ethics review
committees.
HLA class I alleles
Imputation of HLA class I alleles (HLA-A, HLA-B and HLA-C) for 2546 genotyped ART-naïve, HIV+ participants
was performed with HIBAG at 4-digit resolution [17]. Full details of the imputation process and quality control
are described in previous publications [2], [18]. A multi-ethnic pre-trained model was used for imputation
with a minimum out-of-bag accuracy of over 90% for all loci. HLA diversity was measured by the inverse
Simpson index based on the HLA allele frequencies per locus and country. This index represents the
complement of the probability that two participants would have the same HLA allele for a selected locus in
a country.
Immunopeptidomes and binding affinity prediction
Plasma samples were obtained for a subset of 2079 ART-naïve, HIV+ participants from 21 countries enrolled
in the START study. Viral RNA was sequenced, paired-end, using Illumina MiSeq and covered two amplicons
in the HXB2 genome positioned 1485-5058 and 5967-9517. The sample preparation, library preparation,
sequencing procedure, and detailed quality controls have been described previously [19]. Raw reads were
fragmented into 27-mers using KAT [20] and those with a count higher than 1 were translated into peptide
sequences of 9 amino acids length to fit the mean length of HLA Class I epitopes. Peptides were mapped to
10 major HIV proteins (Asp, Gag-Pol, Nef, Vpr, Vpu, gp160, Vif, Pr55, Rev, Tat) from NCBI RefSeq NC001802.1
(Supplementary file 1) using BLAST 2.8.1 (blastp-short). Hits with an E-value > 1E-05 were excluded to remove
low-quality k-mers by considering exact matches. To compare HIV peptidomes with random sequences, a set
of half-million 9-mers were generated by processing the same number of random protein sequences from
Uniprot. Binding affinities for 268 class I HLA alleles to both HIV and random peptidomes using NetMHCpan
4.0 [15]. Three different immunopeptidome subsets were generated by (i) selecting peptides from the top
10% binding affinities, (ii) 10% most variable peptides, and (iii) peptides that potentially bind to at least 10%
of the alleles using <500nM as general threshold [12].
Consensus clustering
Hierarchical clustering was implemented using two different linkage functions. Average linkage was used for
measuring relationships between HLA alleles represented by dissimilarity defined as cosine, correlation and
Euclidean distances. For clustering based on Ward linkage, cosine and correlation distances were corrected
by the square root to satisfy the triangular inequality necessary to operate in Euclidean space [21]. To
generate an ensemble of clustering solutions, we employed consensus clustering to mitigate bias from the
subset, distance metric, or chosen linkage function [22]. A consensus matrix (Cij) of size (n x n) was built where
each element is the number of times an ith allele clustered together with a jth allele [23] at a varying total
number of clusters selected (3 to 160). The consensus matrix was then processed by hierarchical clustering
with average linkage after transforming the values into dissimilarity scores (1 Cij). Clustering was performed
in Python 3.7.1 using the Scipy library [24].
Statistical analyses
Associations of log10-transformed VL with each node of the consensus tree were tested by linear regression
and adjusted by sex, self-reported race, and country. Tested HLA alleles had to be present in more than 10%
of the participants. Multiple testing was controlled by a Benjamini-Hochberg procedure using a q-value <
0.05 to identify associations. Analyses were performed in R v3.6.0 [25].
5
Data visualization and availability
Consensus clustering dendrograms and association coefficients were depicted using Interactive Tree of Life
(iTOL) [26] and tanglegrams, which were generated by the dendextend R package [27]. We provide a flexible
visualization of peptide-to-HLA binding profiles across viral proteins. Data downloads and access to
supplementary materials are made available through the app (https://persimune-health-
informatics.shinyapps.io/PAW2022Zucco__HLA_HIV_INSIGHT/ ).
Results
Baseline characteristics for the START cohort
We used baseline data and the genotypes of 2,546 participants from the START trial. Next-generation
sequencing of HIV samples was retrieved for a subset of 2,079 participants. All participants included in the
trial were asymptomatic HIV+ and ART-naïve with two CD4+ cell counts >500/μL at least 14 days apart within
60 days of enrollment in the trial. Baseline characteristics for study participants can be found in Table 1.
Prediction of patient-derived HIV immunopeptidomes
Our approach for immunopeptidome generation initially yielded 9.88 × 107 peptide 9-mers. After filtering
and mapping with BLAST to ten major HIV proteins, a total of 173,792 9-mers were considered. This
accounted for a 99.22% coverage across the reference proteome. Among the final list, we observed 136 best-
defined CTL/CD8+ epitopes from Los Alamos (version 2019-11-20; Supplementary file 2).
Data exploration
To facilitate the exploration of the results, a web application was developed (Figure 1) providing tools to
navigate interactively the global and local diversity of imputed HLA alleles, HIV subtype-derived peptides,
and their corresponding binding profiles. Shannon and Simpson diversity indices can be directly visualized on
the world map highlighting the geographical diversity of the cohort. This interactive tool facilitates the further
examination of specific HLA allele, HIV subtype, or peptide frequencies for the entire cohort and options to
explore details at a country level.
Higher HLA-A diversity is associated with a lower mean viral load per country
The diversity of HLA class I alleles measured in terms of the inverse Simpson index was calculated using the
HLA allele frequencies per locus and country for comparison. A negative univariate correlation (R = -0.65 p =
0.0018) between HLA-A diversity and mean HIV log10(viral load) per country was found. This indicates that
countries in our cohort with a high diversity of HLA-A alleles would show lower levels of HIV viral load in the
population represented in terms of mean log10(VL). No significant correlation was observed for HLA-B and
HLA-C respectively (Figure 2).
Consensus clustering of HIV-derived immunopeptidomes improves statistical power compared to
random immunopeptidomes
Consensus trees were generated based on a half-million random peptides from Uniprot and the predicted
HIV immunopeptidome under the same methodology. The correlation between both trees was 0.985
indicating high similarity. Minor differences were found in allele distances (Figure S1) with only a few major
structural changes such as the displacement of B*35:20 from a predominantly B*15 node (unspecific
immunopeptidome) to one dominated by B*35 alleles (HIV immunopeptidome). Another two examples are
HLA-B*15:13 and B*15:58 which moved from a node of predominantly HLA-C alleles (unspecific
6
immunopeptidome) to a node of HLA-B alleles (HIV immunopeptidome). These small differences, when
combined, were translated into an increase in statistical power (Figure S2). Overall, when assessed in a
multivariate model including all HLA class I alleles as covariates the combined percentage of explained
variance accounted for 11.44% of the VL after adjustment for sex, self-reported race, and country. Hence,
our clusters of HLA function accounted for HIV-VL variance to a similar degree as reported by Bartha et al.,
2017 [3] for genotypic evidence in a comparatively pure European cohort. This highlights that accounting for
HLA function is both sufficient and better to explain variability in VL in heterogeneous populations.
HLA class I functional nodes associations with HIV-VL
Associations between the HLA functional nodes and the measurements of HIV-VL taken at study entry were
tested using linear regression. Four nodes with a consistent effect size were associated with HIV-VL (Figure
3). These nodes were composed of 30 HLA class I alleles of which 11 were observed in participants with
measurable VL at study entry. Two functional nodes were associated with a lower VL, one group composed
of HLA-B*57:01, B*58:01, B*57:02, and B*57:03 (β -0.25, q-value 7.02E-06) and the second group composed
of a pair of HLA-C*08 alleles, HLA-C*08:04 and C*08:01 (β -0.29, q-value 0.042). In contrast, two nodes
showed an association with higher VL, one cluster composed of six HLA-B*44 alleles, B*44:05, B*44:08,
B*44:04, B*44:03, B*44:02, B*44:27 (β 0.15, q-value 0.003) and a mixed group composed of 16 alleles:
B*35:20, B*35:16, B*35:10, B*35:43, B*35:08, B*35:19, B*35:41, B*35:01, B*35:17, B*35:05, B*44:06,
B*56:03, B*53:01, B*15:08 and B*15:11 (β 0.13, q-value 0.048). From these alleles, only two (HLA-B*57:01
and B*57:03) were associated with HIV-VL when tested at the individual allele level (Figure 3).
Discussion
In this study, we implemented a computational approach based on consensus clustering of HLA alleles using
predicted immunopeptidomes to explore associations of functional HLA groups in a geographically diverse
cohort of PLWH. We defined functional similarity as differences in epitope binding profiles between HLA
alleles and performed HLA imputation on patients enrolled in the START study. We combined the host genetic
information with viral genomics through the prediction of immunopeptidomes. Viral sequences derived from
a subset of participants were processed into peptides to predict binding affinities to 268 HLA class I alleles
which allowed us to generate distance matrices of HLA alleles. Consensus clustering allowed us to
agglomerate HLA alleles into nodes by their functionality. Four nodes were found to be associated with HIV-
VL, implicating alleles that could not be detected when performing independent allele-specific tests alone.
These four nodes accounted for a total of 30 HLA alleles of which 11 were observed in our cohort.
The effects differed among the four nodes associated with HIV-VL. One node containing four well-
characterized alleles HLA-B*57:01, B*58:01, B*57:02, and B*57:03 showed an association with lower HIV-VL
indicating a protective effect. This effect has been previously reported on an individual allele level [28]; we
confirmed and observed shared binding affinity profiles that now indicate their common mode of function.
While having a similar effect in HIV-VL, these four alleles are represented at different frequencies between
populations. For example, HLA-B*57:01 carriers are of European descent compared to B*57:03 which is
predominant among those of African descent [2]. The similarity in function was captured by our analysis
despite the differences in allele frequencies. A second protective node contained two alleles, HLA-C*08:04
and C*08:01, which have been reported in an admixed population and detected after adjusting for multiple
factors such as CD4+ and CD8+ counts [29]. We showed that such association could not only be detected in
our study but that functional clustering provided a clear and simple method to do it with greater mechanistic
insight. We report a node of six HLA-B*44 alleles, in this case, associated with a higher VL. This finding
contradicts a previous study in a Chinese cohort [30] in which a few alleles of the cluster were found to have
7
a protective effect. Given the large effect of viral diversity on VL, this could be a result of regional adaptation
that could not be accounted for in our study due to the lack of participants from the same region. While this
suggests a weakness in our study due to the underrepresentation of some ethnicities, divergence in our
results from studies on homogeneous populations could be a useful tool to detect localized effects. However,
data from a cohort with sufficient Chinese representation is needed to confirm the findings.
Defining discrete nodes of HLA alleles is challenging as distances between alleles are based on predicted
binding affinities and measured continuously. Therefore, metrics to define a fixed number of clusters failed
to suggest a robust threshold for cutting the trees obtained by consensus clustering. For this reason, the final
HIV-associated groups of alleles were determined by assessing the trees hierarchically from leaves (individual
HLA alleles) to roots (broad HLA nodes), thus optimizing groups based on their association with VL when the
combination of alleles in all child nodes supported it. In this way, the methodology implemented provides an
advantage to traditional HLA association studies by adaptively increasing the signal of functional groups
allowing to uncover associations that could not be found at the individual allele level due to the number of
statistical comparisons or sample size [14]. Grouping or clustering of variables to increase statistical power
and limit multiple testing is common practice in epidemiological studies. The functional clustering presented
here, however, is based on the assumption that the majority of HLA functionality is mediated by peptide
binding affinities. This serves to facilitate study design by simplifying decisions on the size and diversity of the
cohort while allowing a biological interpretation of the results. For example, it enables the inference of the
effect of unobserved HLA alleles on HIV-VL given their common function to those observed. These are notable
challenges across clinical studies attempting to measure the effect on the immune response. Altogether, the
functional clustering approach presented is neither specific to HIV nor viral load and may be transferred to
study HLAs and outcomes in other host-pathogen interactions.
Similar computational approaches were proposed to explore HLA class I molecules and their role in HIV-1
infection: These studies mainly focused on genome-to-genome approaches [6], techniques to find new
epitopes in European participants using a peptidome approach [7] or to explore the interactions of HLA class
I molecules to other ligands [31]. In contrast, we focused on the host factors, the HLA class I alleles,
accounting for the high diversity of the cohort and inclusion of a larger number of viral sequences to predict
immunopeptidomes and expand on genome-to-genome analyses previously performed in our same cohort
[19]. While implementations of HLA clustering in the context of HIV-1 have been developed based on a
Bayesian framework [13] these were limited by the diversity of their data and assumptions of the clustering
parameters. Extra considerations must be taken when clustering predicted immunopeptidomes to avoid bias
introduced by predominantly low binding affinity predictions, which can affect the reliability of the computed
distances between HLA alleles. To solve this, we used consensus clustering by computing different subsets of
immunopeptidomes and avoiding bias from non-binders when calculating distances between HLA class I
alleles under multiple linkage functions. The ensemble of implemented techniques avoids bias due to high
dimensionality and has shown success when applied to clustering of biological data [22].
We propose diverse methodological improvements in the analysis of host-viral genetics. From the host
genetics perspective, we showed that imputed HLA class I alleles can be used to calculate HLA diversity
without requiring full HLA sequences. From the viral genetics perspective, we incorporated viral sequences
using a fast k-mer approach to avoid generating a consensus sequence, especially for pathogens of high
genetic variability and to take into account the intra-host viral diversity [32]. When combining host and viral
information through predicted binding affinities, we showed that the differences between clustering based
on specific HIV and random immunopeptidomes facilitate a slightly higher resolution of the trees. We suggest
that the unspecific HLA clusters from the latter approach could be used for other infectious phenotypes. This
8
is likely due to the diversity of viral genomes present in our dataset, and this may not be the case for smaller,
more homogeneous cohorts. However, it suggests that our immunopeptidomes could work as a proxy for
other similarly diverse studies that do not have access to paired viral genomes. Alternatively, new
immunopeptidomes may be proposed based entirely on synthetic data at the risk of increased type-II error.
While we demonstrate the utility of the implemented methodology on HIV, there is also a clear road to
extend it to the analysis of other pathogens eliciting class-I immune responses. Using this common functional
framework, there is scope to analyze the variation of HLA structure in populations and within the context of
multiple pathogens. Another possible application could be as a screening tool for the most effective peptides
either for targeting populations carrying specific allelic distributions or maximizing coverage of vaccines
across geographies. Finally, we propose that the focused consideration of HLA function in terms of peptide
binding affinities provides a promising approach to inform modern vaccine design. This would also be
relevant in the advent of mRNA vaccines coming to fruition after decades of research [33] as they take
advantage of proper peptide selection. As an example, the need for a new class of tools that account for both
variable immunogenic coverage and clinically relevant mutations has been highlighted during the recent
SARS-CoV-2 pandemic [34]. To facilitate open research, the results of this work are made available for
common use via a web application (https://persimune-health-
informatics.shinyapps.io/PAW2022Zucco__HLA_HIV_INSIGHT/ )
Acknowledgements
We would like to thank all participants in the START trial and all trial investigators. See N Engl J Med 2015;
373:795807[16] for the complete list of START investigators.
Author contributions
A.G.Z., J.D.L. and C.R.M conceived the study. A.G.Z., M.B, M.G. and C.E prepared the data. A.G.Z performed
the statistical and computational analyses. A.G.Z. and C.R.M drafted the manuscript. All authors contributed
to data interpretation, critically revised the manuscript, and approved the final version.
Conflict of interest
There are no conflicts of interest to disclose.
Sources of funding
This study was supported by the Danish National Research Foundation (DNRF126) and the National Institute
of Allergy and Infectious Diseases, Division of Clinical Research and Division of AIDS (National Institutes of
Health grants UM1-AI068641, UM1-AI120197 and U01-AI136780). The START trial was supported by the
National Institute of Allergy and Infectious Diseases, National Institutes of Health Clinical Center, National
Cancer Institute, National Heart, Lung, and Blood Institute, Eunice Kennedy Shriver National Institute of Child
Health and Human Development, National Institute of Mental Health, National Institute of Neurological
Disorders and Stroke, National Institute of Arthritis and Musculoskeletal and Skin Diseases, Agence Nationale
de Recherches sur le SIDA et les Hépatites Virales (France), National Health and Medical Research Council
(Australia), National Research Foundation (Denmark), Bundes Ministerium für Bildung und Forschung
(Germany), European AIDS Treatment Network, Medical Research Council (United Kingdom), National
Institute for Health Research, National Health Service (United Kingdom), and the University of Minnesota.
Antiretroviral drugs were donated to the central drug repository by AbbVie, Bristol-Myers Squibb, Gilead
Sciences, GlaxoSmithKline/ViiV Healthcare, Janssen Scientific Affairs, and Merck.
9
References
[1] C. Tian et al., “Genome-wide association and HLA region fine-mapping studies identify susceptibility
loci for multiple common infections,” Nature Communications, vol. 8, no. 1, p. 599, Sep. 2017, doi:
10.1038/s41467-017-00257-5.
[2] C. Ekenberg et al., “Association Between Single-Nucleotide Polymorphisms in HLA Alleles and Human
Immunodeficiency Virus Type 1 Viral Load in Demographically Diverse, Antiretroviral TherapyNaive
Participants From the Strategic Timing of AntiRetroviral Treatment Trial,” J Infect Dis, vol. 220, no. 8,
pp. 13251334, Sep. 2019, doi: 10.1093/infdis/jiz294.
[3] I. Bartha, P. J. McLaren, C. Brumme, R. Harrigan, A. Telenti, and J. Fellay, “Estimating the Respective
Contributions of Human and Viral Genetic Variation to HIV Control,” PLOS Computational Biology, vol.
13, no. 2, p. e1005339, Feb. 2017, doi: 10.1371/journal.pcbi.1005339.
[4] C. Fraser et al., “Virulence and Pathogenesis of HIV-1 Infection: An Evolutionary Perspective,” Science,
vol. 343, no. 6177, Mar. 2014, doi: 10.1126/science.1243727.
[5] P. J. McLaren and M. Carrington, “The impact of host genetic variation on infection with HIV-1,” Nature
Immunology, vol. 16, no. 6, pp. 577583, Jun. 2015, doi: 10.1038/ni.3147.
[6] I. Bartha et al., “A genome-to-genome analysis of associations between human genetic variation, HIV-
1 sequence diversity, and viral control,” eLife Sciences, vol. 2, p. e01123, Oct. 2013, doi:
10.7554/eLife.01123.
[7] J. Arora, P. J. McLaren, N. Chaturvedi, M. Carrington, J. Fellay, and T. L. Lenz, “HIV peptidome-wide
association study reveals patient-specific epitope repertoires associated with HIV control,” PNAS, vol.
116, no. 3, pp. 944949, Jan. 2019, doi: 10.1073/pnas.1812548116.
[8] A. Aflalo and L. H. Boyle, “Polymorphisms in MHC class I molecules influence their interactions with
components of the antigen processing and presentation pathway,” International Journal of
Immunogenetics, vol. n/a, no. n/a, doi: 10.1111/iji.12546.
[9] S. Buhler, J. M. Nunes, and A. Sanchez-Mazas, “HLA class I molecular variation and peptide-binding
properties suggest a model of joint divergent asymmetric selection,” Immunogenetics, vol. 68, no. 6,
pp. 401416, Jul. 2016, doi: 10.1007/s00251-016-0918-x.
[10] J. M. Carlson, A. Q. Le, A. Shahid, and Z. L. Brumme, “HIV-1 adaptation to HLA: a window into virus
host immune interactions,” Trends in Microbiology, vol. 23, no. 4, pp. 212224, Apr. 2015, doi:
10.1016/j.tim.2014.12.008.
[11] A. Sette and J. Sidney, “HLA supertypes and supermotifs: a functional perspective on HLA
polymorphism,” Current Opinion in Immunology, vol. 10, no. 4, pp. 478482, Aug. 1998, doi:
10.1016/S0952-7915(98)80124-6.
[12] M. Thomsen, C. Lundegaard, S. Buus, O. Lund, and M. Nielsen, “MHCcluster, a method for functional
clustering of MHC molecules,” Immunogenetics, vol. 65, no. 9, pp. 655665, Sep. 2013, doi:
10.1007/s00251-013-0714-9.
[13] C. van Dorp and C. Kesmir, “Estimating HLA disease associations using similarity trees,” bioRxiv, p.
408302, Sep. 2018, doi: 10.1101/408302.
[14] A. E. Kennedy, U. Ozbek, and M. T. Dorak, “What has GWAS done for HLA and disease associations?,”
International Journal of Immunogenetics, vol. 44, no. 5, pp. 195211, Oct. 2017, doi: 10.1111/iji.12332.
[15] V. Jurtz, S. Paul, M. Andreatta, P. Marcatili, B. Peters, and M. Nielsen, “NetMHCpan-4.0: Improved
PeptideMHC Class I Interaction Predictions Integrating Eluted Ligand and Peptide Binding Affinity
Data,” The Journal of Immunology, vol. 199, no. 9, pp. 33603368, Nov. 2017, doi:
10.4049/jimmunol.1700893.
10
[16] Insight Start Study Group, “Initiation of Antiretroviral Therapy in Early Asymptomatic HIV Infection,”
New England Journal of Medicine, vol. 373, no. 9, pp. 795807, Aug. 2015, doi:
10.1056/NEJMoa1506816.
[17] X. Zheng et al., “HIBAG—HLA genotype imputation with attribute bagging,” The Pharmacogenomics
Journal, vol. 14, no. 2, pp. 192200, Apr. 2014, doi: 10.1038/tpj.2013.18.
[18] C. Ekenberg et al., “The association of human leukocyte antigen alleles with clinical disease progression
in HIV-positive cohorts with varied treatment strategies,” AIDS, vol. 35, no. 5, pp. 783789, Apr. 2021,
doi: 10.1097/QAD.0000000000002800.
[19] M. Gabrielaite et al., “Human immunotypes impose selection on viral genotypes through viral epitope
specificity,” The Journal of Infectious Diseases, no. jiab253, May 2021, doi: 10.1093/infdis/jiab253.
[20] D. Mapleson, G. Garcia Accinelli, G. Kettleborough, J. Wright, and B. J. Clavijo, “KAT: a K-mer analysis
toolkit to quality control NGS datasets and genome assemblies,” Bioinformatics, vol. 33, no. 4, pp. 574
576, Feb. 2017, doi: 10.1093/bioinformatics/btw663.
[21] S. van Dongen and A. J. Enright, “Metric distances derived from cosine similarity and Pearson and
Spearman correlations,” arXiv:1208.3145 [cs, stat], Aug. 2012, Accessed: Jan. 11, 2019. [Online].
Available: http://arxiv.org/abs/1208.3145
[22] T. Ronan, Z. Qi, and K. M. Naegle, “Avoiding common pitfalls when clustering biological data,” Sci.
Signal., vol. 9, no. 432, pp. re6re6, Jun. 2016, doi: 10.1126/scisignal.aad1932.
[23] A. Strehl and J. Ghosh, “Cluster Ensembles — a Knowledge Reuse Framework for Combining Multiple
Partitions,” J. Mach. Learn. Res., vol. 3, pp. 583617, Mar. 2003, doi: 10.1162/153244303321897735.
[24] P. Virtanen et al., “SciPy 1.0–Fundamental Algorithms for Scientific Computing in Python,” arXiv e-
prints, p. arXiv:1907.10121, Jul. 2019.
[25] R Core Team, R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation
for Statistical Computing, 2019. [Online]. Available: https://www.R-project.org/
[26] I. Letunic and P. Bork, “Interactive Tree Of Life (iTOL) v4: recent updates and new developments,”
Nucleic Acids Res, vol. 47, no. W1, pp. W256W259, Jul. 2019, doi: 10.1093/nar/gkz239.
[27] T. Galili, “dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical
clustering,” Bioinformatics, vol. 31, no. 22, pp. 37183720, Nov. 2015, doi:
10.1093/bioinformatics/btv428.
[28] P. J. R. Goulder and B. D. Walker, “HIV and HLA Class I: An Evolving Relationship,” Immunity, vol. 37, no.
3, pp. 426440, Sep. 2012, doi: 10.1016/j.immuni.2012.09.005.
[29] H. Valenzuela-Ponce et al., “Novel HLA class I associations with HIV-1 control in a unique genetically
admixed population,” Scientific Reports, vol. 8, no. 1, p. 6111, Apr. 2018, doi: 10.1038/s41598-018-
23849-7.
[30] X. Zhang et al., “HLA-B*44 Is Associated with a Lower Viral Set Point and Slow CD4 Decline in a Cohort
of Chinese Homosexual Men Acutely Infected with HIV-1,” Clin Vaccine Immunol, vol. 20, no. 7, pp.
10481054, Jul. 2013, doi: 10.1128/CVI.00015-13.
[31] B. J. Debebe et al., “Identifying the immune interactions underlying HLA class I disease associations,”
eLife, vol. 9, p. e54558, Apr. 2020, doi: 10.7554/eLife.54558.
[32] J. Fellay and V. Pedergnana, “Exploring the interactions between the human and viral genomes,” Hum
Genet, Nov. 2019, doi: 10.1007/s00439-019-02089-3.
[33] N. A. C. Jackson, K. E. Kester, D. Casimiro, S. Gurunathan, and F. DeRosa, “The promise of mRNA
vaccines: a biotech and industrial perspective,” npj Vaccines, vol. 5, no. 1, Art. no. 1, Feb. 2020, doi:
10.1038/s41541-020-0159-8.
11
[34] G. Liu, B. Carter, and D. K. Gifford, “Predicted Cellular Immunity Population Coverage Gaps for SARS-
CoV-2 Subunit Vaccines and Their Augmentation by Compact Peptide Sets,” Cell Systems, vol. 12, no. 1,
pp. 102-107.e4, Jan. 2021, doi: 10.1016/j.cels.2020.11.010.
12
Table 1. Demographic characteristics of the START cohort at study entry.
Characteristics
Genotyped participants
(n=2,546)
Median age (IQR) (years)
36 (2945)
Sex [n (%)]
Female
511 (20.1)
Male
2035 (79.9)
Race/ethnic group [n (%)]
Black
577 (22.7)
Hispanic
498 (19.6)
Asian
26 (1.0)
White
1404 (55.2)
Other
41 (1.6)
Geographical region [n (%)]
Africa
343 (13.5)
Asia
0 (0)
Australia and New Zealand
96 (3.8)
Europe and Israel
1148 (45.1)
Latin America
499 (19.6)
United States and Canada
460 (18.1)
Mode of HIV-infection [n (%)]
Sexual contact
MSM
1633 (64.1)
With a person of the opposite sex
751 (29.5)
Injection-drug use
45 (1.8)
Other
117 (4.6)
Median time since HIV diagnosis (IQR) (years)
1 (03)
ART-naïve [n (%)]
2546 (100)
Median CD4+ T-cell count (IQR) (cells/l)
651 (585759)
Median HIV viral load (IQR) (copies/ml)
14 833 (350346 000)
IQR, interquartile range; MSM, men who have sex with men; ART, Antiretroviral treatment.
13
Figure 1. Web application demonstrating a subset of available data visualizations.
Panel (a) and (b) illustrates the global diversity of imputed HLA alleles in Shannon and Simpson indices,
respectively. Darker (navy) colors indicate higher diversity. Panel (c) showcases the interactive component
to further examine specific HLA allele, an HIV subtype and/or peptide frequencies for the selected cohort
with options to explore individual countries in detail.
14
Figure 2. Correlations between HIV viral load and HLA diversity per country for three HLA loci.
Mean log10(HIV-VL) was depicted (y-axis) as dot plots against HLA diversity measured in terms of Simpson
index (x-axis) for three HLA class I loci (HLA-A, -B, -C) in 21 countries. The noted R corresponds to a Pearson
correlation with its corresponding p-value.
15
Figure 3. Dendrogram of 268 HLA class I alleles based on consensus clustering of predicted binding
affinities to HIV peptides.
Predicted binding affinities to 173,792 HIV peptides were used to calculate the HLA allele distances used for
consensus clustering and represented as a dendrogram through hierarchical clustering. Associations with
log10(HIV-VL) of each node (HLA functional node) and leaves (HLA alleles) in the dendrogram were tested and
adjusted by sex, self-reported race, and country. Associations were defined by an adjusted p-value
(Benjamini-Hochberg) < 0.05 and are represented as thick branches for nodes and black triangles for leaves.
White triangles indicate HLA alleles detected in our cohort. The effect of the respective associations is color-
coded from protective effect (blue) to detrimental (red). On the outer ring, HLA allele counts are depicted as
green bars. An interactive version of the tree can be found in the provided web application.















1
Supplementary figures
Associations of functional HLA class I groups with HIV viral load in a
heterogeneous cohort.
Adrian G. ZUCCO1, Marc BENNEDBÆK2, Christina EKENBERG1, Migle GABRIELAITE3, Preston LEUNG1, Mark N.
POLIZZOTTO4, Virginia KAN5, Daniel D. MURRAY1, Jens D. LUNDGREN1 and Cameron R. MACPHERSON1 for the
INSIGHT START study group.
1PERSIMUNE Center of Excellence, Rigshospitalet, Copenhagen, Denmark
2Virus Research and Development Laboratory, Virus and Microbiological Special Diagnostics, Statens Serum
Institut, Copenhagen, Denmark
3Center for Genomic Medicine, Copenhagen University Hospital, Copenhagen, Denmark.
4Clinical Hub for Interventional Research, College of Health and Medicine, The Australian National University,
Canberra, Australia
5George Washington University, Veterans Affairs Medical Center, Washington D.C, U.S.A.
Corresponding author
Adrian Gabriel Zucco, MSc, PhD student
Tel: +45 35 45 57 75
mail: adrian.gabriel.zucco@regionh.dk
Rigshospitalet, Copenhagen University Hospital
Centre of Excellence for Health, Immunity and Infections (CHIP) & PERSIMUNE
Blegdamsvej 9, DK-2100 Copenhagen Ø, Denmark
2
Figure S1. Tanglegram of consensus clustering from predicted immunopeptidomes based on
random versus HIV-specific peptides.
Two different peptide sets were used for consensus clustering based on predicted immunopeptidomes to 268
HLA class I alleles. On the left, a dendrogram generated from 5x105 random peptides is compared to a
dendrogram generated from 173,792 HIV peptides. Black branches and lines connecting both dendrograms
indicate differences in clustering among both dendrograms.
3
Figure S2. Q-Q plot of observed versus theoretical q-values from associations to HIV-VL from HLA
functional nodes
HLA functional nodes were generated by consensus clustering of predicted HIV-specific immune peptidomes
(red) and unspecific predicted immunopeptidomes from random peptides (blue).
 

Manuscript of study II
Personalized survival probabilities for SARS-CoV-2 positive patients by
explainable machine learning.2
Adrian G. Zucco, Rudi Agius, Rebecka Svanberg, Kasper S. Moestrup, Ramtin Z. Marandi,
Cameron Ross MacPherson, Jens Lundgren, Sisse R. Ostrowski*, Carsten U. Niemann*
*Co-senior authors.
medRxiv, Oct. 2021. (Accepted at Scientific Reports)
1
Personalized survival probabilities for SARS-CoV-2 positive patients
by explainable machine learning
Adrian G. Zucco1, Rudi Agius2, Rebecka Svanberg2, Kasper S. Moestrup1, Ramtin Z. Marandi1, Cameron Ross
MacPherson1, Jens Lundgren1,4, Sisse R. Ostrowski3,4*, Carsten U. Niemann2,4*
1PERSIMUNE Center of Excellence, Rigshospitalet, Copenhagen, Denmark.
2Department of Hematology, Rigshospitalet, Copenhagen, Denmark.
3Department of Clinical Immunology, Rigshospitalet, Copenhagen, Denmark.
4Department of Clinical Medicine, University of Copenhagen, Denmark.
*Co-senior authors.
Correspondence should be addressed to: A.G.Z (adrian.gabriel.zucco@regionh.dk), S.R.O
(Sisse.Rye.Ostrowski@regionh.dk) or C.U.N (Carsten.Utoft.Niemann@regionh.dk).
2
ABSTRACT
Interpretable risk assessment of SARS-CoV-2 positive patients can aid clinicians to implement precision
medicine. Here we trained a machine learning model to predict mortality within 12 weeks of a first positive
SARS-CoV-2 test. By leveraging data on 33,938 confirmed SARS-CoV-2 cases in eastern Denmark, we
considered 2,723 variables extracted from electronic health records (EHR) including demographics, diagnoses,
medications, laboratory test results and vital parameters. A discrete-time framework for survival modelling
enabled us to predict personalized survival curves and explain individual risk factors. Performance on the test
set was measured with a weighted concordance index of 0.95 and an area under the curve for precision-recall
of 0.71. Age, sex, number of medications, previous hospitalizations and lymphocyte counts were identified as
top mortality risk factors. Our explainable survival model developed on EHR data also revealed temporal
dynamics of the 22 selected risk factors. Upon further validation, this model may allow direct reporting of
personalized survival probabilities in routine care.
3
INTRODUCTION
By April 2022 the Coronavirus disease 2019 (COVID-19) had claimed over 6 million lives since its outbreak
in late 20191. COVID-19 is caused by Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and
infected individuals present with a variety of symptoms ranging from asymptomatic to life-threatening2.
Although the majority of SARS-CoV-2 positive cases experience mild to moderate disease approximately 15%
are estimated to develop severe disease3. Progression to severe disease occurs within 1-2 weeks from symptom
onset and is characterized by clinical signs of pneumonia with dyspnea, increased respiratory rate, and decreased
blood oxygen saturation requiring supplemental oxygen37. Development of critical illness is driven by systemic
inflammation, leading to acute respiratory distress syndrome (ARDS), respiratory failure, septic shock, multi-
organ failure, and/or disseminated coagulopathy4,5,8. The majority of these patients require mechanical
ventilation, and mortality for patients admitted to an Intensive Care Unit (ICU) is reported to be 32-50%3,810.
Despite the current vaccination programs, both people already vaccinated and patients not being vaccinated
continue to develop critical COVID-19 disease11. Thus, the pandemic remains a burden on health care systems
worldwide, locally approaching the limit of capacity due to high patient burden and challenging clinical
management.
Several factors have been associated with increased risk of severe disease including old age, male gender, and
lifestyle factors such as smoking and obesity12,13. Comorbidities including hypertension, type 2 diabetes, renal
disease, as well as pre-existing conditions of immune dysfunction and cancer, are also associated with a higher
risk of severe disease and COVID-19 related death12,1416. Among hospitalized patients, risk factors for severe
disease or death include low lymphocyte counts, elevated inflammatory markers and elevated kidney and liver
parameters indicating organ dysfunction6. However, many of these factors likely reflect an ongoing progression
of COVID-19. Identification of high-risk patients is thus warranted at, or prior to, hospital admission to facilitate
personalized interventions.
Multiple COVID-19 prognostic models have been built using traditional statistics frameworks or machine
learning (ML) algorithms. These models have been focused on reduced sets of predictive features from
demographics, patient history, physical examination, and laboratory results17. A systematic review of 50
prognostic models has concluded that such models were poorly reported and are at a high risk of bias18. While
4
great effort has been put into providing prognostic models based on data collected from health systems, these
traditional modelling approaches solely based on domain knowledge may fail. This represents a risk of missing
novel markers and insights about the disease that could come from data-driven models in a hypothesis-free
manner19, which have been reported to outperform models based on curated variables from domain experts20.
Furthermore, ML models facilitate clinical insights21 when coupled with methods for model explainability such
as SHapley Additive exPlanations (SHAP) values22. Model explainability has been developed mainly in the
context of regression and binary classification, but in clinical research where censored observations are
common, explainable time-to-event modelling is required to avoid selection bias23,24. Multiple ML algorithms
have been developed for time-to-event modelling, either by building on top of existing models such as Cox
proportional hazards or by defining new loss functions that model time as continuous25. Here we used an
approach that considered time in discrete intervals and performed binary classification at such time intervals26.
This allowed implementation of gradient boosting decision trees for binary classification to simultaneously
predict personalized survival probabilities27 and allow explainability at the individual patient level using SHAP
values22 including temporal dynamics of risk factors over the course of the disease. This approach provides a
framework for precision medicine that can be applied to other diseases based on routine electronic health
records.
RESULTS
Patient cohort
Based on centralized EHR and SARS-CoV-2 test results from test centres in eastern Denmark, we identified
33,938 patients who had at least one SARS-CoV-2 RT-PCR positive test from 963,265 individuals who had a
test performed between 17th of March 2020 and 2nd of March 2021 (Fig. 1). Of these patients, 5,077 were
hospitalized and 502 were admitted to the ICU (Supplementary Fig. 1). Overall, 1,803 (5.34%) deaths occurred
among all individuals with a positive SARS-CoV-2 RT-PCR test, of whom 141 died later than 12 weeks from
the first positive test (FPT) hence considered as alive for this analysis. Right-censoring was only observed for
patients tested after the 8th of December 2020 with less than 12 weeks of follow-up available while deaths that
occurred the same day of FPT were not considered for training. For the initial model, demographics, laboratory
test results, hospitalizations, vital parameters, diagnoses, medicines (ordered and administered) and summary
5
features were included. Feature encoding resulted in 2,723 features (Supplementary Table 2) which after feature
selection were reduced to 22 features. A summary of the cohort based on the final feature set can be found in
Table 1. This cohort represents an updated subset of individuals residing in Denmark characterized in a previous
publication28.
Survival modelling with machine learning achieves high discriminative performance
To predict the risk of death within 12 weeks from FPT, we trained gradient boosting decision trees considering
time as discrete in a time-to-event framework. Performance was measured on 20% of the data (test set)
unblinded only for performance assessment. The weighted concordance index (C-index) for predicting the risk
of death for all 12 weeks with 95% confidence intervals (CI) was 0.946 (0.941-0.950). Binary metrics were
calculated for each predicted week by excluding censored individuals (Fig. 2). At week 12, the precision-recall
area under the curve (PR-AUC) and Mathew correlation coefficient (MCC) with 95% CI were 0.686 (0.651-
0.720) and 0.580 (0.562-0.597) respectively. The sensitivity was 99.3% and the specificity was 86.4%. The
performance for subgroups of patients displayed some differences. For patients tested outside the hospital (Fig
2b), the C-index was 0.955 (0.950-0.960), the PR-AUC and MCC were 0.675 (0.632-0.719) and 0.585 (0.562-
0.605) respectively. 98.9% sensitivity and 89.9% specificity were measured in this group. For patients
previously admitted to the hospital at the time of test (Fig. 2c), the C-Index was 0.809 (0.787-0.829), the PR-
AUC and MCC were 0.705 (0.640-0.760) and 0.357 (0.325-0.387) respectively. The sensitivity was 100% and
the specificity 31.0% indicating a higher number of false positives when using a 0.5 probability threshold for
this group (Supplementary Table 1). Predictions could be computed in the presence of missing values with only
minimal reductions in performance observed (Supplementary Table 4).
Predicted individual survival distributions represent patients heterogeneity.
Individual survival distributions were predicted for patients in the test set. The median of the predicted
cumulative death probabilities by survival status reflected the discriminative performance of the individual
survival predictions (Fig. 3a). Deceased patients exhibited a risk of mortality that increased for the first month
after FPT. Patients who died 2 months after the FPT exhibited a higher instant risk of death at these later periods
than those patients who died earlier (Fig 3b). Our survival modelling approach is also able to approximate the
time of death within the 12-week time window, as highlighted by the predicted discrete (Fig 3c) and cumulative
6
death probabilities (Fig 3d) for three individual patients. Early death was observed as a steep increase in death
probability in the first weeks while late death was observed as a gradual increase in cumulative death probability
(Fig 3c). Our modelling approach also considered censored patients for which death probabilities were predicted
for all periods even after censoring (Fig 3c-d)
Local and global model explainability reveal temporal dynamics of mortality risk factors
Feature selection for the final model was data-driven using 5-fold cross-validation on the training set. From the
original set of 2,723 features generated from routine EHR data (Supplementary Table 2), 22 features were
selected. This selection was based on feature importance filtering determined by the mean absolute SHAP
values. Among top features, basic characteristics such as age, BMI, and sex, as well as clinical factors such as
the number of unique prescribed medications and diagnosis codes were represented (Fig. 4a). Moreover,
hospitalization at the time of FPT was identified to impact the risk of death. This is further emphasized by the
different performances of the model when restricted to this sub-cohort (Supplementary Table 1). We also
identified the week during the pandemic in which the FPT was taken as having an impact on the risk of death.
Furthermore, the risk of death was higher within the first four weeks after FPT as encoded by the week from
the prediction feature. The model also allowed us to explore the temporal dynamics of individual risk factors
across the predictive 12-weeks window (Fig 4b). Features such as age, ordered loop diuretics, and admission at
the time of FPT had a higher impact on the risk of dying early, while BMI, diagnosis of Alzheimer’s disease,
and ordered B-vitamin contributed more to late risk. Thus, the identification of such time dependency for
features at the individual patient level further reveals different risk factors acting on different time-horizons for
the predicted risk of individual patients (Fig 4c-d).
Machine learning captures non-linear patterns of mortality risk factors
Partial dependency plots (PDP) showed that the model learned non-linear contributions to the risk of mortality.
We found that age contributes to the risk of death over 60 years of age (Fig 5a). BMI seemed to explain a higher
risk of mortality in patients with BMI lower than 30 (Fig 5b), and males presented a higher risk of mortality
than females. (Fig 5c). A higher risk of death was also seen for patients with low lymphocyte count (Fig 5d).
As expected, patients with more hospitalizations and longer cumulative admission days prior to FPT exhibited
a higher risk of death (Fig 5e-f). Similarly, the previously mentioned contribution of being admitted in the
7
hospital at the time of the FPT to the risk of death was observed (Fig 5g). We found that the number of ordered
medicines was a better predictor of death than the number of diagnoses, showing non-linear patterns where
patients with less than five ordered medications in the last year showed up to 10% less risk of death whereas
some patients with more than 20 ordered medications had up to 40% higher risk of death (Fig. 5h).
Interactions between mortality risk factors reveal clusters of features
To unravel interactions between risk factors, we explored the interdependence of the selected features by their
SHAP interactions (Fig. 6). The interaction map for patients who died within 4 weeks from FPT revealed that
the week of prediction and age interacted with several other features including previous hospital admissions and
prescriptions of several drugs for at least 80% of patients (Fig. 6a). Thus, the information provided by these
specific variables combined seems of particular importance for predicting early death (< 4 weeks). For patients
who died after 8 weeks post-FPT, different interaction clusters emerged in which age, number of ordered
medicines, BMI, and vitamin supplements like B-vitamins and calcium with vitamin D interacted in more than
70% of the patients. Also, lymphocyte count and admission at the time of FPT interacted with the number of
medications in at least 60% of the patients (Fig. 6b).
DISCUSSION
We here developed an explainable Machine Learning model for predicting the risk of death within the first 12
weeks from a positive SARS-CoV-2 PCR test. By implementing a discrete-time modelling approach we
computed personalized survival probabilities, explained individual risk factors and achieved high discriminative
performance in terms of C-index (0.946 CI 95%: 0.941-0.950) and PR-AUC (0.686 CI 95%: 0.651-0.720). From
a methodological perspective, we demonstrate how discrete-time modelling provides a framework that allows
the use of existing classification algorithms for survival modelling of EHR while enabling model explainability.
Compared to traditional approaches such as Cox Proportional Hazards. we could model non-linear effects, relax
assumptions, learn interactions between variables and explain temporal dynamics of risk factors without
compromising discriminative performance. This has implications not only in terms of model development by
providing a flexible framework that can be applied to new medical problems but also showcase patient-specific
risk factors and temporal changes in SARS-CoV-2 positive individuals.
8
During the COVID-19 pandemic, attempts have been made to provide prognostic models by implementing
diverse modelling approaches. This has resulted in publications using statistical and Machine Learning (ML)
approaches to predict the diagnosis or prognosis of COVID-19 related outcomes. Meta-analyses have indicated
that the majority of published models suffer from a risk of bias due to overfitting, small sample sizes, poor
cohort definition or not considering censored patients18,29. To overcome some of these previous limitations, we
used electronic health records (EHR) from eastern Denmark, identifying 33,938 patients who had at least one
positive SARS-CoV-2 RT-PCR test. To enable ML algorithms, clinical data need to be encoded into features
that can be computed. Multiple approaches have been suggested for encoding EHR into computationally
meaningful representations30,31 and to represent temporal and uncertain variables 32,33. We opted for a simple
feature engineering approach by considering the latest values or counts in clinically relevant time windows prior
to FPT depending on the type of variable. Additionally, instead of characterizing patients relevant history using
a limited set of pre-selected variables, the set of 22 features in the final model was derived using a data-driven
approach from an initial set of 2,723 features that encoded available demographics, laboratory test results,
hospitalizations, vital parameters, diagnoses and medicines. This approach enabled us to reduce model
complexity to a smaller feature set while avoiding potential bias introduced by pre-selecting variables. While
EHR are more representative of patient populations in terms of real-world data (RWD)34, some challenges arise
when processing EHR for clinical research. Data collected from routine care may present inconsistencies35 that
cannot be appropriately curated for in such big data sets, especially for information regarding clinical
interventions or hospitalization status. We thus selected SARS-CoV-2 positive status and mortality for patient
selection and outcome, respectively, based on robustness to bias from clinical management. Characteristics of
these variables have been previously defined in a Danish nationwide cohort28 from 20th of February 2020 until
19th of May 2020 in alignment with our subset of patients in eastern Denmark.
More importantly, handling time in ML is not only relevant for encoding features but also for the modelling
framework of choice. When handling longitudinal data, time is usually fixed for a specific period and ML
algorithms for binary classification are applied. To do so, patients for which the event of interest was not
observed before they were lost to follow-up (censored) are excluded, resulting in an underestimation of predicted
risks23,24. This has been the predominant modelling approach in COVID-1918,36 related outcomes. Cox models37
are the most common statistical model for time-to-event considering censoring, but multiple ML algorithms
9
allowing for censoring have been proposed25. Models such as regularized Cox models or Random Survival
Forests have been successfully implemented for EHR38 and COVID-1939 data. These models are based on
underlying assumptions such as proportional hazards in the case of Cox based models37 and handle time as
continuous. An alternative is to consider time as discrete26,37 which has demonstrated performance as good or
better than continuous-time models40,41 with the advantage of accounting for censoring while enabling the
implementation and interpretation of existing ML algorithms such as gradient boosting decision trees42. This
approach allows leveraging structured data such as EHR or data from observational studies to model diverse
outcomes43 in which right-censoring of individuals is observed26. The main limitations are that an extra step of
data processing is needed and the value required for the discretization of the time is arbitrarily chosen. A
decision between interpretable versus more representative periods has to be considered for each specific
application. Nevertheless, by implementing a discrete-time model, we overcame the limitations of Cox based
models, by training ML algorithms that learned complex interactions and non-linear effects from the data.
Because no proportionality of hazards was assumed, our model could predict personalized survival
probabilities27 for each patient given their specific context, further facilitating a precision medicine approach44.
To understand model predictions, ML explainability, or explainable artificial intelligence (xAI), is particularly
powerful to enable scientific insights by leveraging the ability of ML models to learn complexity transcending
traditional assumptions21. In some cases, seemingly paradoxical effects have been unravelled when modelling
clinical data45. Multiple approaches have been proposed to open “black-box” models and allow explainability
by, for example, removing features and measuring their impact on the model46. These methods have been
successfully applied in clinical research for various diseases20,47, but in the case of COVID-1948 most of these
are limited to scenarios of binary classification that ignored censoring. As an alternative approach, we provide
explanations of the model predictions based on SHAP values49 that not only decompose the predicted survival
probability for each patient in terms of the features’ contributions but also reflect temporal dynamics of such
contributions in the context of time-to-event modelling. Local explanations as provided in our study are critical
for precision medicine by indicating patient-specific risk factors, but also raise epistemological challenges on
how to extrapolate from local to global explanations50,51. We employed traditional summary statistics to shed
some light on common risk factors, but such a reduction of complexity may imply a reduction of granularity of
factors that are not relevant at the population level but critical for specific patients. Importantly, the features
10
selected as good predictors do not necessarily imply causality21,52 and different sets of features have been
demonstrated to be equally predictive in terms of performance in some cases52.
In line with previous studies, we here identified high age15 and sex (male)53 as important risk factors in COVID-
19. As the importance of age increased significantly for patients over 60 years old, while capturing high age as
a risk factor in itself, our model may further reflect other age-related factors such as an increased prevalence of
comorbidities, which was supported by our analysis of the interaction plots. BMI and obesity have previously
been reported as risk factors for severe COVID-1913 and severe obesity as a risk factor for COVID-related
mortality, especially for younger patients54, who are likely candidates for ICU care and treatment with
mechanical ventilation, resulting in improved survival. In contrast, we identified an increased risk of death for
patients with BMI below 30. This could reflect several other risk factors associated with low BMI, such as
elderly and frail patients with comorbidities. This is supported by the an interaction between BMI and the
number of ordered medicines in early deaths, and interactions with the number of diagnoses, cumulative days
in hospital prior to FPT, and several specific medications for late deaths. Lymphocytopenia was also identified
as a predictor of high mortality in line with previous findings55. This may be a proxy for immune dysfunction,
due to prior or ongoing therapy, malignancy or comorbidity, as well as a severe ongoing COVID-19 disease
itself.
As expected, an increased risk of death was observed in patients with an increased number of medications and
diagnosis codes, likely representing comorbidities, and in line with previous studies56. We found that the
number of ordered medicines was a better predictor of death than the number of diagnoses, emphasizing the
need to capture disease burden based on actual medication in addition to coded diagnoses. This highlights the
need to further explore feature encoding of clinical variables30, to more accurately represent clinical concepts
such as comorbidities. We also observed that hospital encounters for medical examination with known or
unknown causes correlated with a lower risk of death. This may indicate in-patient management of COVID-19
early in the pandemic or reflect increased monitoring of patients with anticipated increased risk of COVID-19,
thereby enabling earlier interventions. Similarly, including the pandemic week in which a patient had their FPT
as a feature revealed that patients early rather than later in the pandemic, had a higher risk of dying. As our data
11
covered both the first and second pandemic wave in Denmark, this finding likely reflects that our model captured
improvements in the clinical management of patients throughout the pandemic57.
The implemented discrete-time modelling approach required encoding the week from FPT as a feature,
revealing explanations of temporal dynamics through SHAP values. When interpreting this feature, a higher
risk of death in the first four weeks was observed, probably capturing the risk due to active infection during that
period58. Critically, our model could differentiate between risk factors for early vs late mortality. Here, hospital
admission at the time of FPT, pandemic week of prediction, age and ordering of loop diuretics were important
factors for early death. Meanwhile, factors explaining the risk of late death (>8 weeks) included lower BMI as
a potential proxy for frail patients, diagnosis of Alzheimer’s disease, and ordered B-vitamin (a probable
indicator of patient malnutrition or alcohol abuse). These factors likely represent patient groups who may not
respond well to treatment and are likely not candidates for ICU or mechanical ventilation, thus exhibiting disease
progression leading to late mortality. This is supported by the interactions observed between age, number of
ordered medicines, (low) BMI, and various vitamin supplements, which are factors likely reflecting patient
frailty. Interestingly, age and number of medicines as a proxy for comorbidity burden before SARS-CoV-2
infection remained prominent risk factors throughout the disease course. This suggests that predicting late deaths
requires a different set of risk factors and consideration of their interactions than predicting early death. Thus,
uncovering the interdependency of features important for early vs late death also indicated time dependency of
risk factors.
CONCLUSION
We developed a data-driven machine learning model to identify SARS-CoV-2 positive patients with a high risk
of death within 12-week from the first positive test. The discrete-time modelling approach implemented not
only allowed us to train survival models with high performance but also enabled model explainability through
SHAP values. By learning temporal dynamics and interactions between clinical features, the model was able to
identify personalized risk factors and high-risk patients for early interventions while improving the
understanding of the disease. The model is made available for prospective implementation into EHR systems
for real-time decision support. However, a prospective assessment of performance in different health systems
and upon changes in the pandemic will be needed. We demonstrate that leveraging electronic health records
12
with explainable ML models provides a framework for the implementation of precision medicine in routine care
which can be adapted to other diseases.
METHODS
Data sources
This study was carried out following the relevant guidelines and regulations. Approval from the Danish
Regional Ethical Committee in the Capital Region (H-20026502) and Data Protection Agency (P-2020-426)
was granted ensuring compliance with the required ethical and legal regulations. Under the Danish law, such
approvals grant access to electronic health records (EHR) for research purposes where informed consent from
patients can be waived given that approval from the Ethical Committee (see approval number above) is obtained
before data access. No biological material or samples that are not reported in the EHR were used in this study.
Data were obtained retrospectively from raw EHR from the Capital Region and Region Zealand (eastern
Denmark), covering a population of 2,761,556 people. Data from the electronic patient journal (EPJ) by EPIC
systems, is logged and stored in the Chronicles database containing live and historic data. Daily extracts are
transferred into the Clarity and Caboodle databases. The final dataset was extracted from the Caboodle database
containing data up to the 2nd of March 2021. Real-Time Polymerase Chain Reaction (RT-PCR) SARS-CoV-2
test results were used to identify 963,265 individuals over 18 years old with a test taken at test sites reporting to
the EHR system between the 17th of March 2020 and 2nd of March 2021 in eastern Denmark.
Feature engineering
Features were generated according to different data types and retrospective time windows including
observations until the day of the first positive SARS-CoV-2 test (FPT). Basic characteristics such as age, sex,
and body mass index (BMI) were encoded as the latest value observed up to the day of FPT. Measurements
represented as continuous values such as laboratory test results (e.g. lymphocyte levels) and vital parameters
(e.g. systolic blood pressure) were encoded as the latest value observed in the last month before the FPT. For
variables measured as categorical values represented by domain-specific codes, features were generated by
counting the total number of occurrences of each and all codes in defined time windows. For diagnoses
represented by International Statistical Classification of Diseases and Related Health Problems version 10 (ICD-
13
10) codes, the selected time window was three years, while for medications represented by Anatomical
Therapeutic Chemical (ATC) codes, the time window was one year. Previous hospitalisations, defined as
hospital stays longer than 24h, were encoded as cumulative days in hospital within the last three years as well
as the total count of hospital admissions in this period of time. Features that may help guide the algorithm by
providing a context of external events were also included. Among these features, are the number of weeks since
the start of the pandemic until the FPT was taken and a binary feature indicating if the patient was hospitalized
when the FPT occurred. Missingness was assumed to be informative and not at random. For diagnoses and
medications, the lack of a code was assumed to be not assigned and encoded as a zero in the features. For
continuous variables such as laboratory values and vitals, missingness was accounted for by the tree-based ML
algorithm chosen without the need for imputation.
Machine Learning approach to survival modelling
To perform time-to-event modelling we considered a discrete-time modelling approach26 to predict 12-week
mortality since a first SARS-CoV-2 positive test. Described by Cox as an approximation to his proposed
proportional hazards assumption for continuous-time modelling37, discretizing time in intervals allowed us, to
perform binary classification at each time interval. By doing this, we trained models that accounted for right-
censored observations, hence reducing the risk of selection bias23, and estimated conditional probabilities of
death given the features that could be computed and explained efficiently without stringent assumptions. Data
was generated from EHR on the 2nd of March 2021, hence right-censoring was observed for patients that had a
positive test from the 8th of December 2020 (12-weeks before data generation) and did not die. The survival
status of these patients could not be ascertained in such a period hence they were only considered for the follow-
up period available. Deaths that occurred the same day of the first positive test (FPT) were excluded. During
the training phase, the original dataset was augmented longitudinally by repeating each patient’s feature set
containing values up to the FPT into patient-weeks. The feature vector for a patient was repeated according to
the number of weeks since the FPT up to the week of death or censoring for a maximum of 12 weeks since FPT.
The main difference between each row is that time was encoded as an ordinal feature indicating the week of
prediction with values ranging from 1 to 12. The target values for each patient-week were set to 0 up to the
week of death or censoring which were indicated as a 1 or a 0 respectively. When using the trained models for
14
prediction, the feature set with values up to the FPT for each patient was augmented longitudinally 12 times.
Time was encoded as an ordinal feature with values 1 to 12 so 12 probabilities of death, one probability per
week per patient, would be predicted. The predicted probabilities of death constitute the hazard function 󰇛󰇜
which can be also expressed as a survival function 󰇛󰇜 and a cumulative density function 󰇛󰇜 as defined
below:
󰇛󰇜 󰇛  󰇜
(1)
󰇛󰇜 󰇛 󰇜 󰇛󰇜

(2)
󰇛󰇜 󰇛 󰇜 󰇛󰇜
(3)
Local and global explainability
SHAP values22 were calculated to quantify the local contribution of each feature to the risk of death of each
individual at each predicted week. Based on Shapley values originally described in the context of game theory,
SHAP values were computed exactly and efficiently for our tree-based models using TreeSHAP59. The SHAP
values computed in log-odds space for all models trained in the ensemble were averaged and transformed into
probabilities by linear scaling. These probabilities represent the local contribution of each feature to the hazard
h󰇛󰇜 for each predicted week. Similarly, a SHAP interaction matrix was generated using TreeSHAP59 from
which the local contribution of pairs of features to the hazard h󰇛󰇜 could also be calculated49. The SHAP
interaction matrix is a feature-by-feature matrix, where the diagonals show the main contribution of a given
feature, whereas the off-diagonals show the pair-wise interactions for all feature pairs. Local SHAP interaction
values present in the off-diagonals can be understood as the difference between SHAP values for the given pair
of features when one of the features is not present. Effectively, given a pair of features, the change in SHAP for
one feature when the other is missing, quantifies the interaction strength between the pair of features. In this
work, we represent the SHAP interaction matrix using a graph, with nodes representing features and edges
15
representing interaction strengths greater than 0.01. This enables us to assess which features act independently
or jointly when predicting risk of death and different predicted weeks.
While local interpretations are useful to understand patient-specific risk factors, global explanations can reveal
general risk factors by summarizing local explanations. To do so, SHAP values were used to estimate feature
importance. We computed each feature’s importance in terms of absolute and mean SHAP values for each
feature. Feature selection was performed by removing features with a mean(|SHAP|) < 0.01. Both local and
global interpretations were provided to clinicians for generating clinical explanations of the risk factors.
Model development and assessment
We trained gradient boosting decision tree models (LightGBM42) using cross-entropy as the objective function
for optimization. To do this, the full dataset was split into training (60%), validation (20%), and test (20%) sets
each one with the same distribution of deaths. Cross-validation (CV) was performed in two steps. First, the
training set was divided into 5 subsets and the subsample rate (0.7), learning rate (0.05), number of iterations
(50) and positive class weight (100) were adjusted using 5-fold cross-validation while the rest of the parameters
were set to default (Supplementary table 3). Once suitable parameters were found, feature selection was
performed based on the validation set. Second, the training set and validation set were combined and split into
5 folds to re-train and generate a final ensemble of 5 models trained on 80% of the data. The performance
reported was assessed by averaging the predictions of the ensemble on the test set (20%), which was not used
for model development.
Based on the predicted cumulative probabilities of death, time-to-event performance was measured by the
concordance index (C-index) based on the inverse probability of censoring weights60 across all weeks.
Performance was further assessed at each week by excluding right-censored cases when calculating binary
metrics and measured in terms of precision-recall area under the curve (PR-AUC), Mathew Correlation
Coefficient (MCC)61, sensitivity and specificity. A threshold of 0.5 was used to turn predicted probabilities into
binary classes. Confidence intervals (95% CI) for the performance metrics were calculated by bootstrapping
with resampling for 1000 iterations.
16
Software
Data wrangling was performed using R62 and the tidyverse library63. Feature engineering was performed in
Python using the pandas64 and numpy65 libraries. Gradient boosting decision trees were trained and implemented
using LightGBM42 assessing model performance using the implementations in scikit-Learn66 and scikit-
survival67. Summary statistics were generated using tableone68.
DATA AND CODE AVAILABILITY
Data can be requested through the corresponding author, however, due to data protection regulations, data
cannot be made publicly available, but the authors will assist external researchers in accessing the data on a
collaborative basis upon request. The trained models and code to run predictions are publicly available on
Github under a GNU Affero General Public License v3.0 (https://github.com/PERSIMUNE/COVIMUN_DT)
17
ACKNOWLEDGEMENTS
The study was supported by a COVID-19 grant from the Ministry of Higher Education and Science (0238-
00006B) and the Danish National Research Foundation (DNRF126). The Capital Region of Denmark, Center
for Economy, provided data extracts from the EHR system.
CONTRIBUTIONS
A.G.Z, R.A, S.R.O and C.U.N conceived the project and supervised it. A.G.Z, R.A and K.S.M. performed data
cleaning. A.G.Z and R.A developed the model and visualizations. A.G.Z, R.A, S.R.O, C.U.N. designed the
study. A.G.Z, R.A, R.S, K.S.M., R.Z.M, S.R.O, C.U.N. interpreted the data and results. A.G.Z, R.A, R.S,
R.Z.M, C.U.N. wrote the paper. All authors commented on and approved the final manuscript.
CONFLICTS OF INTEREST
C.U.N. received research funding and/or consultancy fees outside this work from Abbvie, Janssen, AstraZeneca,
Roche, CSL Behring, Takeda and Octapharma. All other authors have no conflicts of interest.
18
REFERENCES
1. Coronavirus Disease (COVID-19): Weekly Epidemiological Update (20 April 2022) - World. ReliefWeb
https://reliefweb.int/report/world/coronavirus-disease-covid-19-weekly-epidemiological-update-20-april-
2022.
2. Yang, R., Gui, X. & Xiong, Y. Comparison of Clinical Characteristics of Patients with Asymptomatic vs
Symptomatic Coronavirus Disease 2019 in Wuhan, China. JAMA network open 3, e2010182 (2020).
3. Wu, Z. & McGoogan, J. M. Characteristics of and Important Lessons from the Coronavirus Disease 2019
(COVID-19) Outbreak in China: Summary of a Report of 72314 Cases from the Chinese Center for Disease
Control and Prevention. JAMA - Journal of the American Medical Association vol. 323 12391242 (2020).
4. Guan, W. et al. Clinical Characteristics of Coronavirus Disease 2019 in China. New England Journal of
Medicine 382, 17081720 (2020).
5. Huang, C. et al. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. The
Lancet 395, 497506 (2020).
6. Chen, G. et al. Clinical and immunological features of severe and moderate coronavirus disease 2019.
Journal of Clinical Investigation 130, 26202629 (2020).
7. Zhou, F. et al. Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan,
China: a retrospective cohort study. The Lancet 395, 10541062 (2020).
8. Grasselli, G. et al. Baseline Characteristics and Outcomes of 1591 Patients Infected with SARS-CoV-2
Admitted to ICUs of the Lombardy Region, Italy. JAMA - Journal of the American Medical Association
323, 15741581 (2020).
9. Myers, L. C., Parodi, S. M., Escobar, G. J. & Liu, V. X. Characteristics of Hospitalized Adults With
COVID-19 in an Integrated Health Care System in California. JAMA - Journal of the American Medical
Association 323, 21952197 (2020).
10. Docherty, A. B. et al. Features of 20 133 UK patients in hospital with covid-19 using the ISARIC WHO
Clinical Characterisation Protocol: Prospective observational cohort study. The BMJ 369, 112 (2020).
11. Brosh-Nissimov, T. et al. BNT162b2 vaccine breakthrough: clinical characteristics of 152 fully vaccinated
hospitalized COVID-19 patients in Israel. Clinical Microbiology and Infection 0, (2021).
19
12. Reddy, R. K. et al. The effect of smoking on COVID ‐ 19 severity : A systematic review and meta ‐ analysis.
(2020) doi:10.1002/jmv.26389.
13. Gao, F. et al. Obesity Is a Risk Factor for Greater COVID-19 Severity. Diabetes Care (2020)
doi:10.2337/dc20-0682.
14. Yang, L. & Al, E. Effects of cancer on patients with COVID-19: a systematic review and meta-analysis of
63,019 participants. Cancer Biology & Medicine 18, 298307 (2021).
15. Gao, Y. dong et al. Risk factors for severe and critically ill COVID-19 patients: A review. Allergy:
European Journal of Allergy and Clinical Immunology 76, 428455 (2021).
16. Wu, C. et al. Risk Factors Associated with Acute Respiratory Distress Syndrome and Death in Patients with
Coronavirus Disease 2019 Pneumonia in Wuhan, China. JAMA Internal Medicine 180, 934943 (2020).
17. Izcovich, A. et al. Prognostic factors for severity and mortality in patients infected with COVID-19: A
systematic review. PloS one 15, e0241955 (2020).
18. Wynants, L. et al. Prediction models for diagnosis and prognosis of covid-19 infection: systematic review
and critical appraisal. bmj 369, (2020).
19. Yanai, I. & Lercher, M. A hypothesis is a liability. Genome Biology 21, 231 (2020).
20. Agius, R. et al. Machine learning can identify newly diagnosed patients with CLL at high risk of infection.
Nature Communications 11, 117 (2020).
21. Roscher, R., Bohn, B., Duarte, M. F. & Garcke, J. Explainable Machine Learning for Scientific Insights
and Discoveries. arXiv:1905.08883 [cs, stat] (2019).
22. Lundberg, S. & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. arXiv:1705.07874 [cs,
stat] (2017).
23. Li, Y., Sperrin, M., Ashcroft, D. M. & Staa, T. P. van. Consistency of variety of machine learning and
statistical models in predicting clinical risks of individual patients: longitudinal cohort study using
cardiovascular disease as exemplar. BMJ 371, (2020).
24. Vock, D. M. et al. Adapting machine learning techniques to censored time-to-event health record data: A
general-purpose approach using inverse probability of censoring weighting. Journal of Biomedical
Informatics 61, 119131 (2016).
20
25. Wang, P., Li, Y. & Reddy, C. K. Machine Learning for Survival Analysis: A Survey. arXiv:1708.04649
[cs, stat] (2017).
26. Tutz, G. & Schmid, M. Modeling Discrete Time-to-Event Data. (Springer International Publishing, 2016).
doi:10.1007/978-3-319-28158-2.
27. Haider, H., Hoehn, B., Davis, S. & Greiner, R. Effective Ways to Build and Evaluate Individual Survival
Distributions. Journal of Machine Learning Research 21, 163 (2020).
28. Reilev, M. et al. Characteristics and predictors of hospitalization and death in the first 11 122 cases with a
positive RT-PCR test for SARS-CoV-2 in Denmark: a nationwide cohort. International Journal of
Epidemiology 49, 14681481 (2020).
29. Roberts, M. et al. Common pitfalls and recommendations for using machine learning to detect and
prognosticate for COVID-19 using chest radiographs and CT scans. Nat Mach Intell 3, 199217 (2021).
30. Landi, I. et al. Deep representation learning of electronic health records to unlock patient stratification at
scale. npj Digit. Med. 3, 111 (2020).
31. Li, Y. et al. BEHRT: Transformer for Electronic Health Records. Sci Rep 10, 7155 (2020).
32. Fu, J., Ye, J. & Cui, W. The Dice measure of cubic hesitant fuzzy sets and its initial evaluation method of
benign prostatic hyperplasia symptoms. Sci Rep 9, 60 (2019).
33. Cui, W.-H. & Ye, J. Logarithmic similarity measure of dynamic neutrosophic cubic sets and its application
in medical diagnosis. Computers in Industry 111, 198206 (2019).
34. Ramagopalan, S. V., Simpson, A. & Sammon, C. Can real-world data really replace randomised clinical
trials? BMC Medicine 18, 13 (2020).
35. Jensen, P. B., Jensen, L. J. & Brunak, S. Mining electronic health records: towards better research
applications and clinical care. Nature Reviews Genetics 13, 395405 (2012).
36. Jimenez-Solem, E. et al. Developing and validating COVID-19 adverse outcome risk prediction models
from a bi-national European cohort of 5594 patients. Scientific Reports 11, 3246 (2021).
37. Cox, D. R. Regression Models and Life-Tables. Journal of the Royal Statistical Society. Series B
(Methodological) 34, 187220 (1972).
21
38. Steele, A. J., Denaxas, S. C., Shah, A. D., Hemingway, H. & Luscombe, N. M. Machine learning models
in electronic health records can outperform conventional survival models for predicting patient mortality in
coronary artery disease. PLOS ONE 13, e0202344 (2018).
39. Liang, W. et al. Early triage of critically ill COVID-19 patients using deep learning. Nature
Communications 11, 3543 (2020).
40. Kvamme, H. & Borgan, Ø. Continuous and Discrete-Time Survival Prediction with Neural Networks.
arXiv:1910.06724 [cs, stat] (2019).
41. Sloma, M., Syed, F., Nemati, M. & Xu, K. S. Empirical Comparison of Continuous and Discrete-time
Representations for Survival Prediction. in Proceedings of AAAI Spring Symposium on Survival Prediction
- Algorithms, Challenges, and Applications 2021 118131 (PMLR, 2021).
42. Ke, G. et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural
Information Processing Systems 30, (2017).
43. Syrowatka, A. et al. Leveraging artificial intelligence for pandemic preparedness and response: a scoping
review to identify key use cases. npj Digit. Med. 4, 114 (2021).
44. Haendel, M. A., Chute, C. G. & Robinson, P. N. Classification, Ontology, and Precision Medicine. N Engl
J Med 379, 14521462 (2018).
45. Caruana, R. et al. Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day
Readmission. in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining 17211730 (ACM, 2015). doi:10.1145/2783258.2788613.
46. Covert, I., Lundberg, S. & Lee, S.-I. Explaining by Removing: A Unified Framework for Model
Explanation. arXiv:2011.14878 [cs, stat] (2020).
47. Lauritsen, S. M. et al. Explainable artificial intelligence model to predict acute critical illness from
electronic health records. Nature Communications 11, 3852 (2020).
48. Yan, L. et al. An interpretable mortality prediction model for COVID-19 patients. Nature Machine
Intelligence 16 (2020) doi:10.1038/s42256-020-0180-7.
49. Lundberg, S. M. et al. From local explanations to global understanding with explainable AI for trees. Nat
Mach Intell 2, 5667 (2020).
22
50. Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R. & Yu, B. Interpretable machine learning:
definitions, methods, and applications. arXiv:1901.04592 [cs, stat] (2019).
51. Miller, T. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence 267,
138 (2019).
52. Molnar, C. et al. General Pitfalls of Model-Agnostic Interpretation Methods for Machine Learning Models.
arXiv:2007.04131 [cs, stat] (2021).
53. Zhang, J. et al. Risk factors for disease severity, unimprovement, and mortality in COVID-19 patients in
Wuhan, China. Clin Microbiol Infect 26, 767772 (2020).
54. Klang, E. et al. Severe Obesity as an Independent Risk Factor for COVID-19 Mortality in Hospitalized
Patients Younger than 50. Obesity (Silver Spring) 28, 15951599 (2020).
55. Cippà, P. E. et al. A data-driven approach to identify risk profiles and protective drugs in COVID-19. PNAS
118, (2021).
56. Guan, W. et al. Comorbidity and its impact on 1590 patients with COVID-19 in China: a nationwide
analysis. European Respiratory Journal 55, (2020).
57. Benfield, T. et al. Improved Survival Among Hospitalized Patients With Coronavirus Disease 2019
(COVID-19) Treated With Remdesivir and Dexamethasone. A Nationwide Population-Based Cohort
Study. Clinical Infectious Diseases (2021) doi:10.1093/cid/ciab536.
58. Matheson, N. J. & Lehner, P. J. How does SARS-CoV-2 cause COVID-19? Science 369, 510511 (2020).
59. Lundberg, S. M., Erion, G. G. & Lee, S.-I. Consistent Individualized Feature Attribution for Tree
Ensembles. arXiv:1802.03888 [cs, stat] (2018).
60. Uno, H., Cai, T., Pencina, M. J., D’Agostino, R. B. & Wei, L. J. On the C-statistics for evaluating overall
adequacy of risk prediction procedures with censored survival data. Stat Med 30, 11051117 (2011).
61. Chicco, D. & Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and
accuracy in binary classification evaluation. BMC Genomics 21, 6 (2020).
62. R Core Team. R: A Language and Environment for Statistical Computing. (R Foundation for Statistical
Computing, 2019).
63. Wickham, H. et al. Welcome to the tidyverse. Journal of Open Source Software 4, 1686 (2019).
23
64. team, T. pandas development. pandas-dev/pandas: Pandas 1.3.3. (Zenodo, 2021).
doi:10.5281/zenodo.5501881.
65. Harris, C. R. et al. Array programming with NumPy. Nature 585, 357362 (2020).
66. Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12,
2825−2830 (2011).
67. Pölsterl, S. scikit-survival: A Library for Time-to-Event Analysis Built on Top of scikit-learn. Journal of
Machine Learning Research 21, 16 (2020).
68. Pollard, T. J., Johnson, A. E. W., Raffa, J. D. & Mark, R. G. tableone: An open source Python package for
producing summary statistics for research papers. JAMIA Open 1, 2631 (2018).
24
FIGURES
Figure 1. Overview of the data sources, feature engineering and modelling approach for
predicting 12-week mortality in SARS-CoV-2 positive patients.
a, Electronic Health Records (EHR) of 33,938 patients from 17th of March 2020 to 2nd of March 2021 (incidence
curve) in eastern Denmark (geographical region visualized in red) were used to predict 12-week mortality from
the first positive SARS-CoV-2 test (FPT). b, Features were engineered as the last value observed prior to FPT
within the last month for vitals and laboratory values. To encode hospital admissions, medications and
diagnoses, the count of occurrences within three or one year(s) prior to FPT was used. c, Machine learning
algorithms were trained for survival modelling using a discrete-time approach. Time-to-event data were
transformed longitudinally into patient-weeks up to the loss of follow-up (0) or death (1). With the augmented
data, binary classification was performed by gradient boosting decision trees to predict personalized survival
distributions for each patient and provide explanations of individual risk factors using SHAP values.
   







 




















25
Level
Overall
Censored
Died
Survived
n
33938
14907
1662
17369
Age, median [Q1,Q3]
49.0 [33.0,64.0]
50.0 [34.0,66.0]
83.0 [75.0,89.0]
45.0
[31.0,59.0]
Sex, n (%)
Female
19581 (57.7)
8800 (59.0)
787 (47.4)
9994 (57.5)
Number of ordered medicines,
median [Q1,Q3]
0.0 [0.0,4.0]
0.0 [0.0,4.0]
16.0 [4.0,27.0]
0.0 [0.0,2.0]
n (%)
0
20207 (59.5)
8854 (59.4)
168 (10.1)
11185 (64.4)
>= 1
13731 (40.5)
6053 (40.6)
1494 (89.9)
6184 (35.6)
Number of diagnoses,
median [Q1,Q3]
4.0 [2.0,8.0]
4.0 [2.0,8.0]
11.0 [6.0,16.0]
4.0 [2.0,7.0]
n (%)
0
5596 (16.5)
2277 (15.3)
54 (3.2)
3265 (18.8)
>= 1
28342 (83.5)
12630 (84.7)
1608 (96.8)
14104 (81.2)
Admitted at the time of first positive test, n (%)
2485 (7.3)
927 (6.2)
534 (32.1)
1024 (5.9)
Previous admissions in the last 3 years, median
[Q1,Q3]
0.0 [0.0,1.0]
0.0 [0.0,1.0]
1.0 [0.0,3.0]
0.0 [0.0,0.0]
Cumulative days in hospital within the last 3
years, median [Q1,Q3]
0.0 [0.0,1.0]
0.0 [0.0,1.0]
7.0 [0.0,19.0]
0.0 [0.0,0.0]
Pandemic week, median [Q1,Q3]
39.0 [19.0,42.0]
42.0 [41.0,44.0]
40.0 [7.0,43.0]
26.0
[6.0,35.0]
Body Mass Index, median [Q1,Q3]
25.7 [22.6,29.7]
25.6 [22.5,29.8]
24.3 [21.3,28.0]
26.0
[23.0,29.9]
Absolute Lymphocyte count (LYM), laboratory,
last value, median [Q1,Q3]
1.1 [0.7,1.6]
1.1 [0.8,1.7]
0.9 [0.6,1.3]
1.2 [0.8,1.6]
Laxatives (A06AD),
Ordered Medicine, count, n (%)
0
31131 (91.7)
13639 (91.5)
919 (55.3)
16573 (95.4)
>= 1
2807 (8.3)
1268 (8.5)
743 (44.7)
796 (4.6)
Paracetamol (N02BE),
Ordered Medicine, count, n (%)
0
26453 (77.9)
11430 (76.7)
527 (31.7)
14496 (83.5)
>= 1
7485 (22.1)
3477 (23.3)
1135 (68.3)
2873 (16.5)
Loop diuretics (C03CA),
Ordered Medicine, count, n (%)
0
31403 (92.5)
13753 (92.3)
965 (58.1)
16685 (96.1)
>= 1
2535 (7.5)
1154 (7.7)
697 (41.9)
684 (3.9)
Opioid anesthetics (N01AH),
Ordered Medicine, count, n (%)
0
30920 (91.1)
13470 (90.4)
1314 (79.1)
16136 (92.9)
>= 1
3018 (8.9)
1437 (9.6)
348 (20.9)
1233 (7.1)
Vitamin B-complex (A11EA),
Ordered Medicine, count, n (%)
0
33216 (97.9)
14515 (97.4)
1497 (90.1)
17204 (99.1)
>= 1
722 (2.1)
392 (2.6)
165 (9.9)
165 (0.9)
Alzheimer's disease (G30),
Diagnose, count, n (%)
0
33432 (98.5)
14655 (98.3)
1529 (92.0)
17248 (99.3)
>= 1
506 (1.5)
252 (1.7)
133 (8.0)
121 (0.7)
Encounter for medical observation (Z03),
Diagnose, count, n (%)
0
24170 (71.2)
10568 (70.9)
788 (47.4)
12814 (73.8)
>= 1
9768 (28.8)
4339 (29.1)
874 (52.6)
4555 (26.2)
Encounter for other special examination (Z01),
Diagnose, count, n (%)
0
18284 (53.9)
7485 (50.2)
637 (38.3)
10162 (58.5)
>= 1
15654 (46.1)
7422 (49.8)
1025 (61.7)
7207 (41.5)
Essential hypertension (I10), Diagnose, count, n
(%)
0
31501 (92.8)
13783 (92.5)
1256 (75.6)
16462 (94.8)
>= 1
2437 (7.2)
1124 (7.5)
406 (24.4)
907 (5.2)
Loop diuretics (C03CA), Administered medicine,
count, n (%)
0
32384 (95.4)
14135 (94.8)
1177 (70.8)
17072 (98.3)
>= 1
1554 (4.6)
772 (5.2)
485 (29.2)
297 (1.7)
Calcium + vitamin D (A12AX), Administered
medicine, count, n (%)
0
32530 (95.9)
14200 (95.3)
1267 (76.2)
17063 (98.2)
>= 1
1408 (4.1)
707 (4.7)
395 (23.8)
306 (1.8)
Table 1. Summary statistics of the cohort based on the final feature set.
Values up to the day of the first positive SARS-CoV-2 test used for training and prediction were considered.
Continuous variables were summarized by the median and interquartile ranges (Q1, Q3). Diagnoses and
medicines with their ICD-10 and ATC codes in parentheses respectively were summarized as the number of
26
patients with at least one code assigned. Only body mass index and absolute lymphocyte counts reported missing
values for 17,823 and 32,803 patients respectively. Patients that had a positive test from the 8th of December
2020 (12-weeks before data generation) and did not die before the 2nd of March 2021 were censored.
Figure 2. Binary performance metrics for 12 weeks mortality prediction
Precision-recall area under the curve (PR-AUC) and Mathews correlation coefficient (MCC) were calculated
for each predicted week only considering non-censored patients in the test set. The lower panel of each plot
depicts the mean values of PR-AUC and MCC at each week based on all patients (a), patients not admitted to
the hospital at the time of first positive test (b) and patients who were admitted at the time of first positive test
(c). The upper panels of each subfigure contain bar plots showing the number of patients who died (red) during
the given week while patients censored due to lack of follow-up (grey) were omitted for the performance
metrics.
27
Figure 3. Predicted individual discrete and cumulative death probabilities.
Weekly discrete and cumulative probabilities of death were predicted for all patients in the test set using data
prior to their first positive test. Individual probabilities were summarized by the median, 80 and 20 percentiles
for patients who died (red) or survived (green) (a). Predicted cumulative death probabilities were summarized
by the median (b) for patients who died before 4 weeks (pink), between 4 and 8 weeks (yellow) and after 8
weeks (blue). Individual examples of predicted cumulative (c) and discrete (d) death probabilities for three
patients are depicted indicating the time of death (black dot) or censoring (x).
28
Figure 4. Global and local explanations of feature contributions to the risk of death in SARS-
CoV-2 positive patients.
SHAP values for each patient-week in the test set were calculated to explain the contribution of features to the
discrete probability of death. A beeswarm plot (a) was generated to agglomerate all individual SHAP values for
each patient-week with features coloured according to their normalised feature values. To explore the temporal
dynamics, heatmaps were generated to show the maximum feature importance represented as the max(|SHAP|)
across all patients (b) for each predicted week. The total feature importance of each feature was calculated as
the mean(|SHAP|) across all weeks and shown as a bar plot (b). To exemplify personalized explanations, SHAP
values for two patients (c-d) were depicted as heatmaps with their corresponding predicted discrete probabilities
of death on top. The original feature values for each patient were reported inside round brackets next to the
feature names. In all heatmaps, features were ordered by hierarchical clustering of the original feature values
using Pearson correlation as the distance metric and average linkage.
29
Figure 5. Individual feature explanations by survival status.
Partial dependence plots (PDP) of SHAP values versus age (a), body mass index (b), sex (c), Lymphocytes
levels (d), cumulative days in hospital (e) and the number of admissions (f) in the last 3 years, admission status
at the time of first positive test (g) and the number of ordered medicines (h). Each dot shows a patient-week
value coloured by survival status indicating those patients who survived (green) or died (red). Total SHAP
values are represented as explained contributions in terms of probability (y-axis) given all the features values
for a patient whereas features (x-axis) are represented by their corresponding value. The top and left panels of
each PDP plot depict letter-value plots of the distribution of the x and y axes by survival status. Top panels were
substituted by bar plots for categorical variables. Additional PDPs for the remaining features can be found in
Supplementary Fig 2-4.
30
Figure 6. Summary of relevant feature interactions in explaining early and late mortality in
SARS-CoV-2 positive patients.
For each patient that died within 12 weeks, the SHAP interaction values between all 22 features were calculated.
Only interaction values with an absolute value greater than 0.01 were considered relevant and counted. Counts
were averaged across all patients to show the percentage rate a given pair of features was relevant. The diagonal
represents the percentage of patients for which each feature had a SHAP value higher than 0.01. a, shows
relevant feature interactions for patients who died within 4 weeks and for those who died between 8-12 weeks
(b) thus visualizing the difference in feature interactions for early and late mortality in SARS-CoV-2 positive
patients. In both heatmaps, features were ordered by hierarchical clustering using Euclidean distance as the
metric for average linkage.
Supplementary information
Personalized survival probabilities for SARS-CoV-2 positive patients by
explainable machine learning
Adrian G. Zucco1, Rudi Agius2, Rebecka Svanberg2, Kasper S. Moestrup1, Ramtin Z. Marandi1, Cameron Ross
MacPherson1, Jens Lundgren1,4, Sisse R. Ostrowski3,4*, Carsten U. Niemann2,4*
1PERSIMUNE Center of Excellence, Rigshospitalet, Copenhagen, Denmark.
2Department of Hematology, Rigshospitalet, Copenhagen, Denmark.
3Department of Clinical Immunology, Rigshospitalet, Copenhagen, Denmark.
4Department of Clinical Medicine, University of Copenhagen, Denmark.
*Co-senior authors.
Correspondence should be addressed to: A.G.Z (adrian.gabriel.zucco@regionh.dk), S.R.O
(Sisse.Rye.Ostrowski@regionh.dk) or C.U.N (Carsten.Utoft.Niemann@regionh.dk).
Supplementary figure 1. Consort diagram of the cohort.
963,265 individuals were identified using Real-Time Polymerase Chain Reaction (RT-PCR) SARS-CoV-2 test results
taken between the 17th of March 2020 and 2nd of March 2021 in eastern Denmark. All reported numbers correspond to
positive tests, admissions, discharges and deaths that occurred in the mentioned period independently if such events
occurred after 12 weeks from a first positive test.
Group
TP
FN
FP
TN
Precision
Sensitivity
Specíficity
ROC-
AUC
Censored
Deaths
All patients
552
4
851
5408
0.3934
0.9928
0.864
0.9703
5359
556
Patients tested
outside the
hospital
371
4
592
5292
0.3853
0.9893
0.8994
0.9774
5022
375
Patients
admitted to the
hospital at the
time of test
181
0
259
116
0.4114
1
0.3093
0.8576
337
181
Supplementary Table 1. Binary performance metrics by admission status.
Binary metrics were assessed 12 weeks from the first positive test. True positives (TP), false negatives (FN), false
positives (FP), true negatives (TN), area under the receiver operating characteristic curve (ROC-AUC).
Supplementary Figure 2. Partial dependence plots of diagnoses by survival status.
Supplementary Figure 3. Partial dependence plots of medications by survival status.
Supplementary Figure 4. Partial dependence plots of temporal features by survival status.
Diagnoses represented as ICD -10 codes
A01, A02, A04, A05, A06, A07, A08, A09, A15, A16, A17, A18, A23, A24, A26, A28, A35, A37, A38, A40, A41, A44, A46,
A48, A49, A51, A52, A53, A54, A56, A60, A63, A64, A68, A69, A70, A74, A77, A79, A80, A81, A86, A87, A88, A89, B00,
B01, B02, B07, B08, B09, B15, B16, B17, B18, B20, B22, B23, B25, B26, B27, B30, B33, B34, B35, B36, B37, B44, B48,
B49, B50, B51, B54, B55, B58, B65, B66, B67, B71, B76, B80, B83, B86, B90, B91, B94, B95, B96, B97, B98, B99, C01,
C02, C04, C05, C06, C07, C09, C10, C11, C13, C15, C16, C17, C18, C20, C21, C22, C23, C24, C25, C26, C30, C32, C34,
C37, C38, C40, C43, C44, C45, C46, C48, C49, C50, C51, C52, C53, C54, C56, C57, C60, C61, C62, C64, C65, C66, C67,
C69, C71, C73, C74, C76, C77, C78, C79, C80, C81, C82, C83, C84, C85, C86, C88, C90, C91, C92, C93, C94, C95, C96,
C99, D03, D04, D05, D06, D07, D09, D10, D11, D12, D13, D14, D15, D16, D17, D18, D19, D20, D21, D22, D23, D24, D25,
D27, D28, D29, D30, D31, D32, D33, D34, D35, D36, D37, D38, D39, D40, D41, D43, D44, D45, D46, D47, D48, D50, D51,
D52, D55, D56, D58, D59, D60, D61, D62, D63, D64, D65, D66, D67, D68, D69, D70, D71, D72, D73, D74, D75, D76, D80,
D82, D83, D84, D86, D89, E03, E04, E05, E06, E07, E10, E11, E13, E14, E15, E16, E20, E21, E22, E23, E24, E25, E26, E27,
E28, E29, E30, E31, E34, E41, E46, E47, E50, E51, E53, E55, E56, E58, E60, E61, E63, E64, E65, E66, E67, E68, E70, E72,
E73, E74, E75, E78, E79, E80, E83, E84, E85, E86, E87, E88, E89, F00, F01, F02, F03, F04, F05, F06, F07, F09, F10, F11,
F12, F13, F14, F15, F16, F17, F19, F20, F21, F22, F23, F25, F28, F29, F30, F31, F32, F33, F34, F38, F39, F40, F41, F42, F43,
F44, F45, F48, F50, F51, F52, F53, F55, F59, F60, F61, F62, F63, F64, F65, F68, F70, F71, F72, F73, F78, F79, F80, F81, F82,
F84, F88, F89, F90, F91, F92, F93, F94, F95, F98, F99, G00, G02, G03, G04, G05, G06, G10, G11, G12, G14, G20, G21,
G22, G23, G24, G25, G30, G31, G35, G36, G37, G40, G41, G43, G44, G45, G46, G47, G50, G51, G52, G53, G54, G55, G56,
G57, G58, G59, G60, G61, G62, G63, G64, G70, G71, G72, G73, G80, G81, G82, G83, G90, G91, G92, G93, G94, G95, G96,
G97, G98, G99, H00, H01, H02, H03, H04, H05, H06, H10, H11, H13, H15, H16, H17, H18, H19, H20, H21, H22, H25, H26,
H27, H28, H30, H31, H33, H34, H35, H36, H40, H42, H43, H44, H45, H46, H47, H48, H49, H50, H51, H52, H53, H54, H55,
H57, H58, H60, H61, H62, H65, H66, H68, H69, H70, H71, H72, H73, H74, H80, H81, H82, H83, H90, H91, H92, H93, H94,
H95, I05, I06, I07, I10, I11, I12, I13, I15, I20, I21, I23, I24, I25, I26, I27, I30, I31, I32, I33, I34, I35, I36, I37, I38, I39, I40,
I42, I44, I45, I46, I47, I48, I49, I50, I51, I52, I60, I61, I62, I63, I64, I65, I66, I67, I68, I69, I70, I71, I72, I73, I74, I77, I78, I79,
I80, I81, I82, I83, I85, I86, I87, I88, I89, I95, I97, I99, J00, J01, J02, J03, J04, J05, J06, J09, J10, J11, J12, J13, J14, J15, J16,
J17, J18, J20, J21, J22, J30, J31, J32, J33, J34, J35, J36, J37, J38, J39, J40, J41, J42, J43, J44, J45, J46, J47, J61, J62, J64, J67,
J68, J69, J70, J80, J81, J82, J84, J85, J86, J90, J91, J92, J93, J94, J95, J96, J98, K00, K01, K02, K03, K04, K05, K06, K07,
K08, K09, K10, K11, K12, K13, K14, K20, K21, K22, K25, K26, K27, K28, K29, K30, K31, K35, K36, K37, K38, K40, K41,
K42, K43, K44, K45, K46, K50, K51, K52, K55, K56, K57, K58, K59, K60, K61, K62, K63, K64, K65, K66, K70, K71, K72,
K73, K74, K75, K76, K80, K81, K82, K83, K85, K86, K87, K90, K91, K92, L00, L01, L02, L03, L04, L05, L08, L10, L11,
L12, L13, L20, L21, L22, L23, L24, L25, L26, L27, L28, L29, L30, L40, L41, L42, L43, L44, L50, L51, L52, L53, L55, L56,
L57, L58, L59, L60, L63, L64, L65, L66, L67, L68, L70, L71, L72, L73, L74, L80, L81, L82, L84, L85, L88, L89, L90, L91,
L92, L93, L94, L95, L97, L98, L99, M00, M02, M05, M06, M07, M08, M10, M11, M12, M13, M14, M15, M16, M17, M18,
M19, M20, M21, M22, M23, M24, M25, M30, M31, M32, M33, M34, M35, M40, M41, M42, M43, M45, M46, M47, M48,
M49, M50, M51, M53, M54, M60, M61, M62, M65, M66, M67, M68, M70, M71, M72, M75, M76, M77, M79, M80, M81,
M82, M83, M84, M85, M86, M87, M88, M89, M90, M91, M92, M93, M94, M95, M96, M99, N00, N02, N03, N04, N05,
N06, N08, N10, N11, N12, N13, N15, N16, N17, N18, N19, N20, N21, N25, N26, N27, N28, N30, N31, N32, N34, N35, N36,
N39, N40, N41, N42, N43, N44, N45, N46, N47, N48, N49, N50, N51, N60, N61, N62, N63, N64, N70, N71, N72, N73, N74,
N75, N76, N80, N81, N82, N83, N84, N85, N86, N87, N88, N89, N90, N91, N92, N93, N94, N95, N96, N97, N98, N99, O00,
O01, O02, O03, O04, O05, O07, O08, O10, O12, O13, O14, O16, O20, O21, O22, O23, O24, O26, O28, O30, O31, O32, O34,
O35, O36, O40, O41, O42, O43, O44, O45, O46, O47, O48, O49, O60, O61, O62, O63, O64, O65, O66, O67, O68, O69, O70,
O71, O72, O73, O74, O75, O80, O81, O82, O83, O84, O85, O86, O87, O88, O89, O90, O91, O92, O98, O99, P01, P02, P05,
P20, P29, P74, P92, Q00, Q03, Q04, Q05, Q06, Q07, Q10, Q11, Q12, Q14, Q15, Q17, Q18, Q20, Q21, Q22, Q23, Q24, Q25,
Q26, Q27, Q28, Q30, Q32, Q35, Q36, Q37, Q39, Q40, Q43, Q51, Q53, Q54, Q61, Q62, Q63, Q64, Q65, Q66, Q67, Q68, Q72,
Q74, Q75, Q77, Q78, Q79, Q81, Q82, Q83, Q84, Q85, Q86, Q87, Q89, Q90, Q91, Q92, Q93, Q95, Q96, Q97, Q98, Q99, R00,
R01, R02, R03, R04, R05, R06, R07, R09, R10, R11, R12, R13, R14, R15, R16, R17, R18, R19, R20, R21, R22, R23, R25,
R26, R27, R29, R30, R31, R32, R33, R34, R35, R39, R40, R41, R42, R43, R44, R45, R46, R47, R48, R49, R50, R51, R52,
R53, R55, R56, R57, R58, R59, R60, R61, R62, R63, R64, R67, R68, R69, R70, R71, R73, R74, R76, R77, R78, R79, R80,
R81, R82, R84, R87, R89, R90, R91, R92, R93, R94, S00, S01, S02, S03, S04, S05, S06, S07, S09, S10, S11, S12, S13, S14,
S15, S19, S20, S21, S22, S23, S24, S25, S27, S29, S30, S31, S32, S33, S34, S36, S37, S38, S39, S40, S41, S42, S43, S44, S45,
S46, S49, S50, S51, S52, S53, S54, S56, S57, S59, S60, S61, S62, S63, S64, S65, S66, S67, S68, S69, S70, S71, S72, S73, S74,
S76, S79, S80, S81, S82, S83, S84, S85, S86, S87, S89, S90, S91, S92, S93, S96, S97, S98, S99, T00, T01, T02, T04, T07,
T08, T09, T10, T11, T12, T13, T14, T15, T16, T17, T18, T19, T20, T21, T22, T23, T24, T25, T26, T27, T28, T29, T30, T31,
T32, T33, T35, T38, T39, T40, T41, T42, T43, T45, T46, T47, T50, T51, T52, T53, T54, T55, T58, T59, T62, T63, T65, T66,
T67, T68, T69, T70, T71, T73, T74, T75, T78, T79, T80, T81, T82, T83, T84, T85, T86, T87, T88, T90, T91, T92, T93, T95,
T98, V03, V10, V11, VRA, VRB, VRK, X60, X61, X62, X63, X64, X69, X70, X71, X78, X81, X82, X83, X84, X91, X95,
X99, Y04, Z00, Z01, Z02, Z03, Z04, Z06, Z07, Z08, Z09, Z10, Z11, Z12, Z13, Z20, Z21, Z22, Z23, Z24, Z25, Z26, Z27, Z29,
Z30, Z31, Z32, Z34, Z35, Z36, Z37, Z38, Z39, Z40, Z41, Z42, Z47, Z48, Z50, Z51, Z52, Z54, Z55, Z56, Z57, Z58, Z59, Z60,
Z61, Z62, Z63, Z64, Z65, Z70, Z71, Z72, Z73, Z74, Z75, Z76, Z80, Z81, Z82, Z83, Z84, Z85, Z86, Z87, Z88, Z89, Z90, Z91,
Z92, Z93, Z94, Z95, Z96, Z97, Z98, Z99
Medications represented as ATC codes
A01*, A01AA, A01AB, A01AC, A01AD, A02AA, A02AD, A02AH, A02BA, A02BB, A02BC, A02BX, A02X, A03AA,
A03AB, A03AX, A03BA, A03BB, A03FA, A04AA, A04AD, A05AA, A05BA, A06AA, A06AB, A06AC, A06AD, A06AG,
A06AH, A06AX, A07AA, A07BA, A07CA, A07DA, A07EA, A07EC, A07FA, A07XA, A08AA, A08AB, A09AA, A10AB,
A10AC, A10AD, A10AE, A10BA, A10BB, A10BD, A10BH, A10BJ, A10BK, A10BX, A11AA, A11AB, A11CA, A11CC,
A11DA, A11E, A11EA, A11EB, A11GA, A11HA, A11JC, A12*, A12A, A12AA, A12AX, A12BA, A12CA, A12CB,
A12CC, A12CE, A12CX, A16AA, A16AX, B01*, B01AA, B01AB, B01AC, B01AD, B01AE, B01AF, B01AX, B02AA,
B02BA, B02BB, B02BC, B02BD, B02BX, B03A, B03AA, B03AB, B03AC, B03AE, B03BA, B03BB, B03XA, B05*,
B05AA, B05BA, B05BB, B05BC, B05CX, B05DA, B05DB, B05XA, B05XC, B06AC, C***, C01AA, C01BC, C01BD,
C01CA, C01CE, C01CX, C01DA, C01DX, C01EB, C02AB, C02AC, C02CA, C02DB, C02DC, C02DD, C02KX, C03AA,
C03AB, C03BA, C03CA, C03DA, C03EA, C03EB, C03XA, C05AA, C05AE, C05BA, C05BB, C05CA, C07AA, C07AB,
C07AG, C07BB, C07CB, C08CA, C08DA, C08DB, C09AA, C09BA, C09CA, C09DA, C09DX, C09XA, C10AA, C10AB,
C10AC, C10AD, C10AX, C10BA, D01AC, D01AE, D01BA, D02AB, D02AC, D02AE, D02AF, D02AX, D03AX, D04AB,
D05AA, D05AX, D05BA, D05BB, D06AA, D06AX, D06BA, D06BB, D06BX, D07AA, D07AB, D07AC, D07AD, D07BC,
D07CA, D07CC, D07XC, D08AB, D08AC, D08AJ, D08AX, D10AB, D10AD, D10AE, D10AF, D10AX, D10BA, D11A,
D11AH, D11AX, G01*, G01AA, G01AF, G02AB, G02AD, G02BA, G02BB, G02CB, G02CX, G03AA, G03AB, G03AC,
G03AD, G03BA, G03CA, G03CX, G03DA, G03DB, G03FA, G03FB, G03GA, G03GB, G03HA, G03HB, G03XB, G03XC,
G04BD, G04BE, G04CA, G04CB, H01AA, H01AB, H01AC, H01AX, H01BA, H01BB, H01CA, H01CB, H01CC, H02AA,
H02AB, H03AA, H03BA, H03BB, H03CA, H04AA, H05AA, H05BA, H05BX, J01AA, J01CA, J01CE, J01CF, J01CR,
J01DB, J01DC, J01DD, J01DH, J01DI, J01EA, J01EB, J01EC, J01EE, J01FA, J01FF, J01GB, J01MA, J01XA, J01XB,
J01XC, J01XD, J01XE, J01XX, J02AA, J02AC, J02AX, J04AB, J04AC, J04AK, J04AM, J04BA, J05AB, J05AD, J05AE,
J05AF, J05AG, J05AH, J05AJ, J05AP, J05AR, J05AX, J06BA, J06BB, J07AE, J07AG, J07AH, J07AJ, J07AL, J07AM,
J07AP, J07BA, J07BB, J07BC, J07BD, J07BF, J07BK, J07BM, L***, L01*, L01AA, L01AB, L01AC, L01AD, L01AX,
L01BA, L01BB, L01BC, L01CA, L01CB, L01CD, L01CE, L01CX, L01DB, L01DC, L01EA, L01EB, L01EE, L01EJ, L01EL,
L01EX, L01XA, L01XB, L01XC, L01XD, L01XE, L01XF, L01XG, L01XK, L01XX, L01XY, L02AB, L02AE, L02BA,
L02BB, L02BG, L02BX, L03AA, L03AB, L03AX, L04*, L04AA, L04AB, L04AC, L04AD, L04AX, M01AA, M01AB,
M01AC, M01AE, M01AG, M01AH, M01AX, M02AA, M03AB, M03AC, M03AX, M03BB, M03BX, M03CA, M04AA,
M04AB, M04AC, M05BA, M05BB, M05BX, N01AF, N01AH, N01AX, N01BA, N01BB, N02AA, N02AB, N02AE, N02AG,
N02AJ, N02AX, N02B, N02BA, N02BE, N02BG, N02CA, N02CC, N02CD, N02CX, N03AA, N03AB, N03AD, N03AE,
N03AF, N03AG, N03AX, N04AA, N04AB, N04BA, N04BB, N04BC, N04BD, N04BX, N05AA, N05AB, N05AD, N05AE,
N05AF, N05AG, N05AH, N05AL, N05AN, N05AX, N05BA, N05BB, N05BE, N05CC, N05CD, N05CF, N05CH, N05CM,
N06AA, N06AB, N06AF, N06AG, N06AX, N06BA, N06BC, N06DA, N06DX, N07AA, N07AX, N07BA, N07BB, N07BC,
N07CA, N07XX, P01AB, P01BA, P01BB, P01BC, P01BE, P01BF, P02CA, P02CF, P02CX, P03AC, R01A, R01AA, R01AC,
R01AD, R01AX, R02AX, R03AC, R03AK, R03AL, R03BA, R03BB, R03CC, R03DA, R03DC, R03DX, R05CB, R05DA,
R06AA, R06AD, R06AE, R06AX, R07AB, R07AX, S01*, S01AA, S01AD, S01AE, S01AX, S01BA, S01BC, S01CA,
S01EA, S01EB, S01EC, S01ED, S01EE, S01FA, S01FB, S01GA, S01GX, S01HA, S01LA, S01XA, S02*, S02AA, S02BA,
S02CA, S02D, S02DC, S03CA, V01AA, V03AB, V03AC, V03AE, V03AF, V03AX, V04CD, V04CH, V04CX, V06*, V06D,
V07A, V07AB, V07AC, V07AY, V08AB, V08CA, V08DA, V09AX, A01, A01AA, A01AB, A01AC, A01AD, A02AA,
A02AD, A02AH, A02BA, A02BB, A02BC, A02BX, A02X, A03AA, A03AB, A03BA, A03BB, A03FA, A04AA, A04AD,
A05AA, A05BA, A06AA, A06AB, A06AC, A06AD, A06AG, A06AH, A06AX, A07AA, A07BA, A07CA, A07DA, A07EA,
A07EC, A07FA, A09AA, A10AB, A10AC, A10AD, A10AE, A10BA, A10BB, A10BD, A10BH, A10BJ, A10BK, A10BX,
A11AA, A11CC, A11DA, A11E, A11EA, A11EB, A11GA, A11HA, A12AA, A12AX, A12BA, A12CA, A12CB, A12CC,
A12CE, A12CX, A16AB, B01, B01AA, B01AB, B01AC, B01AD, B01AE, B01AF, B01AX, B02AA, B02BA, B02BB,
B02BC, B02BD, B03A, B03AA, B03AB, B03AC, B03AE, B03BA, B03BB, B03XA, B05, B05AA, B05BA, B05BB, B05BC,
B05CX, B05DA, B05DB, B05XA, B05XC, C01AA, C01BC, C01BD, C01CA, C01CE, C01CX, C01DA, C01DX, C01EB,
C02AB, C02AC, C02CA, C02DB, C02DC, C02DD, C02KX, C03AA, C03AB, C03BA, C03CA, C03DA, C03EA, C03XA,
C05AA, C05BA, C05BB, C05CA, C07AA, C07AB, C07AG, C08CA, C08DA, C08DB, C09AA, C09BA, C09CA, C09DA,
C09DX, C09XA, C10AA, C10AB, C10AC, C10AD, C10AX, D01AC, D01AE, D01BA, D02AB, D02AC, D02AX, D05AA,
D05AX, D05BA, D05BB, D06AA, D06AX, D06BA, D06BB, D06BX, D07AA, D07AB, D07AC, D07AD, D07BC, D07CA,
D07CC, D07XC, D08AJ, D10AD, D10AE, D10AF, D10AX, D11AH, D11AX, G01AF, G02AB, G02AD, G02BA, G02CB,
G02CX, G03AA, G03AB, G03AC, G03BA, G03CA, G03CX, G03DA, G03FA, G03HA, G03XB, G03XC, G04BD, G04BE,
G04CA, G04CB, H01AA, H01AC, H01BA, H01BB, H01CB, H02AA, H02AB, H03AA, H03BA, H03BB, H04AA, H05AA,
H05BA, H05BX, J01AA, J01CA, J01CE, J01CF, J01CR, J01DB, J01DC, J01DD, J01DH, J01EA, J01EB, J01EE, J01FA,
J01FF, J01GB, J01MA, J01XA, J01XB, J01XC, J01XD, J01XE, J01XX, J02AA, J02AC, J02AX, J04AB, J04AC, J04AK,
J04AM, J05AB, J05AF, J05AH, J05AJ, J05AR, J05AX, J06BA, J06BB, J07AG, J07AJ, J07AL, J07AM, J07BB, J07BC,
J07BF, L, L01, L01AA, L01AC, L01AD, L01BA, L01BB, L01BC, L01CA, L01CB, L01CD, L01CE, L01CX, L01DB,
L01DC, L01EB, L01EJ, L01EL, L01EX, L01XA, L01XB, L01XC, L01XD, L01XE, L01XF, L01XG, L01XX, L01XY,
L02AE, L02BA, L02BB, L02BG, L02BX, L03AA, L03AB, L03AX, L04AA, L04AB, L04AC, L04AD, L04AX, M01AA,
M01AB, M01AE, M01AH, M01AX, M02AA, M03AB, M03AC, M03AX, M03BB, M03BX, M03CA, M04AA, M04AB,
M04AC, M05BA, M05BX, N01AF, N01AH, N01AX, N01BB, N02AA, N02AB, N02AE, N02AG, N02AJ, N02AX, N02B,
N02BA, N02BE, N02BG, N02CC, N02CD, N02CX, N03AA, N03AB, N03AE, N03AF, N03AG, N03AX, N04AA, N04AB,
N04BA, N04BB, N04BC, N04BD, N04BX, N05AA, N05AB, N05AD, N05AE, N05AF, N05AG, N05AH, N05AL, N05AN,
N05AX, N05BA, N05BB, N05BE, N05CC, N05CD, N05CF, N05CH, N05CM, N06AA, N06AB, N06AG, N06AX, N06BA,
N06BC, N06DA, N06DX, N07AA, N07AX, N07BA, N07BB, N07BC, N07CA, N07XX, P01AB, P01BA, P01BC, R01A,
R01AA, R01AC, R01AD, R01AX, R03AC, R03AK, R03AL, R03BA, R03BB, R03CC, R03DA, R03DC, R03DX, R05CB,
R05DA, R06AA, R06AD, R06AE, R06AX, R07AB, R07AX, S01, S01AA, S01AX, S01BA, S01BC, S01CA, S01EA, S01EB,
S01EC, S01ED, S01EE, S01FA, S01FB, S01GA, S01GX, S01HA, S01LA, S01XA, S02AA, S02CA, S03CA, V03AB,
V03AC, V03AE, V03AF, V04CH, V04CX, V06, V07A, V07AB, V07AC, V07AY, V08AB, V08CA, V08DA, V09AX
Laboratory tests represented as acronyms and local variable names
25-Hydroxy-Vitamin D2;P, 25-Hydroxy-Vitamin D3;P, 25-OH-vitamin D (D3+D2);P, 3,4-Methylendioxyamfetamin;U, 3-
Hydroxybutyrat;P, AFP, AGASBASE, AGASCO2, AGASHCO2, AGASLAKTAT, AGASO2, AGASPH, AGASSAT, ALB,
ALP, ALT, AMYL, ANA, AV peak gradient, Acetoacetat (semikvant);U, Albumin / Kreatinin-ratio;U, Albumin;Csv,
Albumin;Plv, Albumin;U, Amfetamin+analog;U, Ammonium;P, Amylase, pancreastype;P, Anion gap (inkl. K+);P,
Antithrombin (enz.);P, Antitrombin (enz.);P, Antitrombin (koag.);P, Antitrombin;P, Ao V max, Ao V2 VTI, Ao V2 max PG,
Ao V2 max vel, Aspartattransaminase [ASAT];P, Aspergillus (galactomannan Ag);P, Autofortolkning, B2M, BAC-test;B,
BAS, BCx, BF-test udløb;P, BIL, BLYM, BSA, BUN, Bacterium+fungus;B(kateter; hæmodial.), Bacterium, nitrit-prod.
(semikvant);U, Basisk fosfatase;P, Benzodiazepiner;U, Benzoylecgonin;U, Blastceller (uspec.);B, Bloddyrkning
(Bakterium+fungus);aB, Bloddyrkning (fungus);B(CVK), Bloddyrkning (fungus);aB, Blodtype (AB0; Rh D);Erc(B), Brain
natriuretisk peptid [BNP];P, Buprenorphin;U, C3, CARDIOLIPINIGG, CARDIOLIPINIGM, CD3, CD4, CD8, CHD-4-IgG
[Mi-2];P, CICLO, CK, CKMB, CMVIGG, CMVPCR, CO2 total;P(vB), CORONAVIRUS SARS-COV-2 TOTAL IG, CRE,
CRP, Calcium (albuminkorrigeret);P, Calcium-ion (frit)(pH=7,40; kont.renal erstat.terapi);P(vB;efter filter), Calcium-ion frit
(pH = 7,40);P(vB;efter filter), Calcium-ion frit;P, Calcium-ion frit;P(vB;efter filter), Calcium;P, Candida mannan (Ag);P,
Candida mannan-Ab;P, Candida-relateret egenskab gruppe;P, Cannabinol;U, Carbonmonoxidhæmoglobin;Hb(B), Centromer-
IgG;P, Cerebrospinalvæske gruppe;Csv, Combat-COVID19-24T, Combat-COVID19-Start, Cystatin C;P, Cytomegalovirus-
Ab;P, DDIM, DIFFBERE, DNA topoisomerase1-IgG [Scl70];P, DNAIGG, Deoxyhæmoglobin;Hb(tot.;aB), Digoxin;P, Direkte
antiglobulin gruppe;Erc(B), Diverse analyse til KBA, EBNA, EBVPCR, EGFR, EOS, EVOL, EXOSC10-IgG [PM-Scl100];P,
Eosinofilocytter;Csv, Erythrocyt-Ab gruppe;P, Erythrocytter (semikvant.);U, Erytroblaster;B, Erytrocytter;B, Erytrocytter;Csv,
Erytrocytvol. rel. spredning;Erc(B), Erytrocytvolumen (middel) [MCV];B, Ethanol;P, FER, FIBR, Fibrillarin-IgG;P, Folat;P,
Fosfat;P, GLUC, Gentamicin;P, Glomerulær basalmembran-IgG;P, Glukose (semikvant);U, Glukose;Csv, Glukose;P(aB),
Glukose;P(kB), HAEM, HAPTO, HAVAB, HBVABC, HBVABS, HBVAGS, HBVIGM, HCG, HCVIGG, HDL, HGH
CORONA - PAKKE, HIV 1+2 (Ag+Ab);P, Heparin lav molmasse (enz.);P, Histidin-tRNA-ligase (Jo1)-IgG;P,
Hydrogencarbonat (akt.;Pt-tp);P(aB), Hydrogencarbonat;P(aB), Hydrogenkarbonat (aktuel);P(vB), Hæmoglobin (frit);P,
Hæmoglobin (semikvant);U, Hæmoglobin A1c (IFCC);Hb(B), Hæmoglobin [MCHC];Erc(B), Hæmoglobinindhold
[MCH];Erc(B), Hæmoglobinindhold;Rtcs(B), IGA, IGG, IGM, INR, IVS, Insulin;P(fPt), Interleukin-6;P, Jern;P, K+, KIA
PROIL:TruCulture forskning;B, KIA profil: Multiplate, KIA profil: T-/B-/NK-lymfocytter;B, KOL, Kalium;Pt(U), Kalium;U,
Kappa / Lambda-kæde (Ig) frit;P, Kappa-kæde (Ig) frit;P, Karbamid;Pt(U), Karbamid;U, Kerneholdige celler (uspec.);Csv,
Kerneholdige celler;Csv, Kerneholdige celler;Plv, Keton;B, Klarhed før centrifugering;Csv, Klorid;P, Koag. TF-induceret
tid;B, Koag. heparin-uafh tid;B, Koag. overflade-induceret (APTT)(BFH);P, Koag. overflade-induceret [APTT];P, Koag.
overflade-induceret tid;B, Koagel-forskydningsstyrke;B, Koageldannelse, TF-induc.;B, Koagellyse, TF-induc.;B,
Koagelstyrke, TF-induc.;B, Koagelstyrke, trombocyt-uafh;B, Koagulationsfaktor II+VII+X;P, Kolesterol LDL (beregnet);P,
Kolesterol VLDL;P, Kolesterol non-HDL;P, Kortisol;P, Kreatinin-clearance;Nyre, Kreatinin;Pt(U), Kreatinin;U, LAC, LDH,
LDL, LEUK, LV Mass Index, LV V1 max PG, LV mean PG, LVIDD, LVIDD index, LVIDS, LVOT diam, LVOT peak VTI,
LYM, Laktat;Csv, Laktat;P, Laktat;P(vB), Laktatdehydrogenase;Plv, Lambda-kæde (Ig) frit;P, Leukoblaster;B, Leukocytter
(mononukl.);Csv, Leukocytter (mononukl.);Plv, Leukocytter (polynukl.);Csv, Leukocytter (polynukl.);Plv, Leukocytter
(semikvant);U, Leukocytter (uspec.);B, Leukocyttype gruppe;B, Levetiracetam;P, Lipase;P, Lithium;P,
Lymfocytter+plasmaceller+blaster;B, Lymfocytter;Csv, M-komponent;P, MON, MV E/A, MV dec time, Magnesium;P, Major
centromere B-IgG;P, Makrofager+monocytter;Csv, Markørundersøgelse, immundefekt projekt,
Metamyelo.+Myelo.+Promyelocytter;B, Metamyelo.+myelo.+promyelocytter;B, Metamyelocytter;B, Methadon;U,
Methæmoglobin;Hb(B), Mitose spindel-IgG;P, Monocytter+blaster;B, Morphin+analog;U, Multiplate-ADP;Trc(B), Multiplate-
ASPI;Trc(B), Multiplate-TRAP(max);TRC(B), Myelocytter;B, Myoglobin;P, NA+, NEU, NK, Natrium;Pt(U), Natrium;U,
Neutrofilocytter;Csv, Neutrophilocytter;B, Nitrit (semikvant);U, Nitrit;U, Nøgne kerner;B, O2 sat.;Hb(B), O2
sat.;Hb(aB;pulm.), O2 sat.;Hb(cvB), O2 sat.;Hb(kB), O2 sat.;Hb(vB), O2 sat;Hb(vB;pulm.), O2-flow;Pt,
ORDERGROUP_EBV, Osmolalitet;P, Osmolalitet;U, Oxyhæmoglobin;Hb(aB;pulm.), Oxyhæmoglobin;Hb(tot.; aB),
Oxyhæmoglobin;Hb(tot.; aB), Oxyhæmoglobin;Hb(tot.;aB), Oxyhæmoglobin;Hb(tot.;aB), Oxyhæmoglobin;Hb(tot.; vB),
Oxyhæmoglobin;Hb(tot.; vB), Oxyhæmoglobin;Hb(tot.;vB), Oxyhæmoglobin;Hb(tot.;vB), Oxyhæmoglobin;Hb(tot.;cvB), P
akse, P taks varighed, PR interval, PROCAL, PV peak gradient, PW, Paracetamol;P, Parathyrin [PTH];P, Plasmocytter;B,
Pleuravæske gruppe;Plv, Pro-brain natriuretisk pept. [proBNP];P, Proinsulin C-peptid;P, Proinsulin C-peptid;P(fPt),
Prolifererende nucleus-IgG;P, Promyelocytter;B, Prostataspecifikt antigen;P, Protein (semikvant);U, Protein;Csv, Protein;P,
Protein;Plv, Protein;Pt(U), Protein;U, Proteinase 3-IgG [PR3];P, Prøvemateriale, QRS interval, QT interval, QTc (Bassett’s
formel), QTc (Fridericia’s formel), R akse, RNA pol III RPC1-IgG;P, RR interval, RSJ SÆRAFTALE 00461, Reticulocytter
gruppe;B, Reticulocytter;B, Reticulocytter;Erc(B), Ribosomal protein-IgG [Rib P];P, Rotem, Sedimentationsreaktion;B,
Sjøgren syndrom [SSA]-IgG;P, Sjøgren syndrom [SSB]-IgG;P, Smiths-IgG;P, Smudge celler;Csv, Store ufarvede celler;B,
Syrebasestatus gruppe;Pt, Syrebasestatus gruppe;Pt(aB), Syrebasestatus gruppe;Pt(vB), T akse, T-lymphocyt
(helper/cytotox);Lymc(B), TEG, TEG FF-MA, TEG FFH-MA, TEG-Angle, TEG-LY30, TEG-MA, TEG-R, TEG-hep-
Angle;B, TEG-hep-LY30;B, TEG-hep-MA;B, TEG-hep-R, TEG-heparinase, TR Max Vel, TR max PG, Tacrolimus;B,
Thyrotropin [TSH]-reflextest;P, Thyrotropin [TSH];P, Thyroxin [T4];P, Thyroxin frit [T4];P, Transferrin-mætning;P,
Transferrin;P, Triglycerid;P, Triiodthyronin [T3];P, Triiodthyronin frit [T3];P, Trombocytter;B, Troponin I;P, Troponin T;P,
Troponin;P, U-PAR;P, U1 snRNP (70 kDa+A+C)-IgG;P, Urat;P, Urinopsamlingstid;Pt, Urinundersøgelse stix gruppe;U,
Vancomycin;P, Ventrikelfrekvens, Vitamin B12;P, Volumen;Pt(U), Voriconazol;P, Zink;P, eAG, gamma-Glutamyltransferase
[GGT];P, pCO2;P, pCO2;P(aB;pulm.), pCO2;P(cvB), pCO2;P(kB), pCO2;P(vB), pCO2;P(vB;pulm.), pH;P, pH;P(aB;pulm.),
pH;P(cvB), pH;P(kB), pH;P(vB), pH;U, pO2 (halvmætn.);Hb(B), pO2;P, pO2;P(aB;pulm.), pO2;P(cvB), pO2;P(kB),
pO2;P(vB), pO2;P(vB;pulm.), suPAR;P
Vital parameters
Blood pressure (diastolic), blood pressure (systolic), Glascow coma scale (GSC), Early Warning Score total, oxygen supply
(L/min.), Pulse, Respiratory frequency (per minute), Oxygen Saturation, Temperature, Body Mass Index.
Demographics
Sex, Age, Pandemic wave, Pandemic week
Hospitalizations
Admitted at the time of first positive test, Previous admissions in the last 3 years, Cumulative days in hospital in the last 3 years
Summary features
Number of diagnoses, Number of ordered medicines, Number of administered medicines
Supplementary Table 2. Full set of features before feature selection
Feature names and codes for the 2723 initial features that were used for feature selection. Values of the features were
encoded using different time windows prior to a first positive test (FPT) and summary metrics according to the data
type as detailed in text and figure 1. For vital parameters and laboratory tests, the last value within one month before
FPT was used. For the case of diagnoses and medications, the total count of assigned codes within the last 3 and 1
year(s) before FPT, respectively, was encoded. Hospitalizations within the last 3 years before FPT were considered.
Summary features were generated by counting the total number of diagnoses and medicine codes assigned to a patient.
Parameter
Value
boosting_type
gbdt
colsample_bytree
1
importance_type
split
learning_rate
0.05
max_depth
-1
min_child_samples
20
min_child_weight
0.001
min_split_gain
0
n_estimators
100
n_jobs
-1
num_leaves
31
objective
binary
reg_alpha
0
reg_lambda
0
silent
TRUE
subsample
0.7
subsample_for_bin
200000
subsample_freq
0
num_iterations
50
scale_pos_weight
100
metric
auc
seed
1234
Supplementary Table 3. Hyperparameters for LightGBM classifiers.
Feature name
Precision
Sensitivity
Specificity
ROC-
AUC
MCC
PRAUC
C-
index
Baseline (no missing values added)
0.393
0.993
0.864
0.970
0.580
0.686
0.946
Age
0.676
0.270
0.989
0.899
0.398
0.533
0.859
Number of ordered medicines
0.420
0.905
0.889
0.956
0.571
0.607
0.930
Admitted at the time of first positive
test
0.407
0.966
0.875
0.968
0.582
0.670
0.943
Body Mass Index
0.396
0.971
0.868
0.965
0.574
0.640
0.943
Diagnose, count, Z01: Encounter for
other special examination
0.387
0.993
0.860
0.967
0.574
0.655
0.941
Pandemic week
0.383
0.993
0.858
0.967
0.570
0.654
0.940
Sex
0.398
0.993
0.866
0.969
0.584
0.682
0.945
Ordered Medicine, count, A06AD:
Laxatives
0.398
0.993
0.867
0.970
0.584
0.689
0.945
Number of diagnoses
0.385
0.989
0.860
0.967
0.571
0.651
0.944
Ordered Medicine, count, N02BE:
Paracetamol
0.392
0.991
0.863
0.969
0.578
0.666
0.945
laboratory, last value, LYM: Absolute
Lymphocyte count
0.393
0.982
0.865
0.967
0.575
0.669
0.943
Ordered Medicine, count, C03CA: Loop
diuretics
0.392
0.987
0.864
0.969
0.577
0.671
0.944
Cumulative days in hospital within the
last 3 years
0.395
0.989
0.865
0.969
0.580
0.675
0.945
Ordered Medicine, count, N01AH:
Opioid anesthetics
0.393
0.993
0.864
0.969
0.580
0.667
0.944
Previous admissions in the last 3 years
0.393
0.991
0.864
0.970
0.579
0.679
0.945
Diagnose, count, G30: Alzheimer's
disease
0.393
0.991
0.864
0.970
0.579
0.683
0.945
Diagnose, count, Z03: Encounter for
medical observation
0.394
0.991
0.864
0.970
0.580
0.681
0.945
Diagnose, count, I10: Essential
hypertension
0.392
0.993
0.863
0.969
0.579
0.676
0.945
Administered medicine, count, C03CA:
Loop diuretics
0.394
0.993
0.864
0.970
0.580
0.675
0.945
Ordered Medicine, count, A11EA:
Vitamin B-complex
0.393
0.989
0.865
0.970
0.579
0.687
0.945
Administered medicine, count, A12AX:
Calcium + vitamin D
0.394
0.993
0.864
0.970
0.580
0.680
0.945
Supplementary Table 4. Performance metrics on the test set under missing information.
Performance metrics at week 12 were calculated on the test set by iteratively setting all values for each of the features
on the final model as missing. The lowest performances are highlighted in bold characters.
ResearchGate has not been able to resolve any citations for this publication.
Preprint
Full-text available
Interpretable risk assessment of SARS-CoV-2 positive patients can aid clinicians to implement precision medicine. Here we trained a machine learning model to predict mortality within 12 weeks of a first positive SARS-CoV-2 test. By leveraging data on 33,928 confirmed SARS-CoV-2 cases in eastern Denmark, we considered 2,723 variables extracted from electronic health records (EHR) including demographics, diagnoses, medications, laboratory test results and vital parameters. A discrete-time framework for survival modelling enabled us to predict personalized survival curves and explain individual risk factors. Performances of weighted concordance index 0.95 and precision-recall area under the curve 0.71 were measured on the test set. Age, sex, number of medications, previous hospitalizations and lymphocyte counts were identified as top mortality risk factors. Our explainable survival model developed on EHR data also revealed temporal dynamics of the 22 selected risk factors. Upon further validation, this model may allow direct reporting of personalized survival probabilities in routine care.
Article
Full-text available
Background There is limited data on outcomes of moderate to severe Coronavirus disease 2019 (COVID-19) among patients treated with remdesivir and dexamethasone in a real-world setting. Objective To compare the effectiveness of standard of care (SOC) alone vs SOC plus remdesivir and dexamethasone. Methods Two population-based nationwide cohorts of individuals hospitalized with COVID-19 during February through December 2020. Death within 30 days and need of mechanical ventilation (MV) were compared by inverse probability of treatment weighted (ITPW) logistic regression analysis and shown as odds ratio (OR) with 95% confidence interval (CI). Results The 30-d mortality rate of 1694 individuals treated with remdesivir and dexamethasone in addition to SOC was 12.6% compared to 19.7% for 1053 individuals receiving SOC alone. This corresponded to a weighted OR of 30-day mortality of 0.47 (95% CI, 0.38-0.57) for patients treated with remdesivir and dexamethasone compared to patients receiving SOC alone. Similarly, progression to MV was reduced (OR 0.36 (95% CI, 0.29-0.46)). Conclusions and relevance Treatment of moderate to severe COVID-19 during June through December that included remdesivir and dexamethasone was associated with reduced 30-day mortality and need of MV compared to treatment in February through May.
Article
Full-text available
Artificial intelligence (AI) represents a valuable tool that could be widely used to inform clinical and public health decision-making to effectively manage the impacts of a pandemic. The objective of this scoping review was to identify the key use cases for involving AI for pandemic preparedness and response from the peer-reviewed, preprint, and grey literature. The data synthesis had two parts: an in-depth review of studies that leveraged machine learning (ML) techniques and a limited review of studies that applied traditional modeling approaches. ML applications from the in-depth review were categorized into use cases related to public health and clinical practice, and narratively synthesized. One hundred eighty-three articles met the inclusion criteria for the in-depth review. Six key use cases were identified: forecasting infectious disease dynamics and effects of interventions; surveillance and outbreak detection; real-time monitoring of adherence to public health recommendations; real-time detection of influenza-like illness; triage and timely diagnosis of infections; and prognosis of illness and response to treatment. Data sources and types of ML that were useful varied by use case. The search identified 1167 articles that reported on traditional modeling approaches, which highlighted additional areas where ML could be leveraged for improving the accuracy of estimations or projections. Important ML-based solutions have been developed in response to pandemics, and particularly for COVID-19 but few were optimized for practical application early in the pandemic. These findings can support policymakers, clinicians, and other stakeholders in prioritizing research and development to support operationalization of AI for future pandemics.
Article
Full-text available
Machine learning methods offer great promise for fast and accurate detection and prognostication of coronavirus disease 2019 (COVID-19) from standard-of-care chest radiographs (CXR) and chest computed tomography (CT) images. Many articles have been published in 2020 describing new machine learning-based models for both of these tasks, but it is unclear which are of potential clinical utility. In this systematic review, we consider all published papers and preprints, for the period from 1 January 2020 to 3 October 2020, which describe new machine learning models for the diagnosis or prognosis of COVID-19 from CXR or CT images. All manuscripts uploaded to bioRxiv, medRxiv and arXiv along with all entries in EMBASE and MEDLINE in this timeframe are considered. Our search identified 2,212 studies, of which 415 were included after initial screening and, after quality screening, 62 studies were included in this systematic review. Our review finds that none of the models identified are of potential clinical use due to methodological flaws and/or underlying biases. This is a major weakness, given the urgency with which validated COVID-19 models are needed. To address this, we give many recommendations which, if followed, will solve these issues and lead to higher-quality model development and well-documented manuscripts.
Article
Background Many interventional in-patient COVID-19 trials assess primary outcomes through day 28 post-randomization. Since a proportion of patients experience protracted disease or relapse, such follow-up period may not fully capture the course of the disease, even when randomization occurs a few days after hospitalization. Methods Among adults hospitalized with COVID-19 in Eastern Denmark from March 18, 2020 - January 12, 2021 we assessed: all-cause mortality, recovery and sustained recovery 90 days after admission, and readmission and all-cause mortality 90 days after discharge. Recovery was defined as hospital discharge and sustained recovery as recovery and alive without readmissions for 14 consecutive days. Results Among 3,386 patients included in the study 2,796 (82.6%) reached recovery and 2,600 (77.0%) achieved sustained recovery. Of those discharged from hospital, 556 (19.9%) were readmitted, and 289 (10.3%) died. Overall, the median time to recovery was 6 days (Interquartile range (IQR), 3-10), and 19 days (IQR, 11-33) among patients in intensive care in the first two days of admission. Conclusions Post-discharge readmission and mortality rates were substantial. Therefore, sustained recovery should be favored to recovery outcomes in clinical COVID-19 trials. A 28-day follow-up period may be too short the critically ill.
Article
In the past decade, the application of machine learning (ML) to healthcare has helped drive the automation of physician tasks as well as enhancements in clinical capabilities and access to care. This progress has emphasized that, from model development to model deployment, data play central roles. In this Review, we provide a data-centric view of the innovations and challenges that are defining ML for healthcare. We discuss deep generative models and federated learning as strategies to augment datasets for improved model performance, as well as the use of the more recent transformer models for handling larger datasets and enhancing the modelling of clinical text. We also discuss data-focused problems in the deployment of ML, emphasizing the need to efficiently deliver data to ML models for timely clinical predictions and to account for natural data shifts that can deteriorate model performance. This Review discusses the use of deep generative models, federated learning and transformer models to address challenges in the deployment of machine learning for healthcare.
Preprint
Human Leucocyte Antigen (HLA) class I alleles are the main host genetic factors involved in controlling HIV-1 viral load (VL). Nevertheless, HLA diversity has proven a significant challenge in association studies. We assessed how accounting for binding affinities of HLA class I alleles to HIV-1 peptides facilitate association testing of HLA with HIV-1 VL in a heterogeneous cohort from the Strategic Timing of AntiRetroviral Treatment (START) study. We imputed HLA class I alleles from host genetic data (2,546 HIV+ participants) and sampled immunopeptidomes from 2,079 host-paired viral genomes (targeted amplicon sequencing). We predicted HLA class I binding affinities to HIV-1 and unspecific peptides, grouping alleles into functional clusters through consensus clustering. These functional HLA class I clusters were used to test associations with HIV VL. We identified four clades totalling 30 HLA alleles accounting for 11.4% variability in VL. We highlight HLA-B*57:01 and B*57:03 as functionally similar but yet overrepresented in distinct ethnic groups, showing when combined a protective association with HIV+ VL (log, β −0.25; adj. p-value < 0.05). We further demonstrate only a slight power reduction when using unspecific immunopeptidomes, facilitating the use of the inferred functional HLA groups in other studies. The outlined computational approach provides a robust and efficient way to incorporate HLA function and peptide diversity, aiding clinical association studies in heterogeneous cohorts. To facilitate access to the proposed methods and results we provide an interactive application for exploring data. Abstract Figure
Article
Objectives mRNA COVID-19 vaccines have shown high effectiveness in the prevention of symptomatic COVID-19, hospitalization, severe disease, and death. Nevertheless, a minority of vaccinated individuals might get infected and suffer significant morbidity. Characteristics of vaccine breakthrough infections have not been studied. We sought to portray the population of Israeli patients, who were hospitalized with COVID-19 despite full vaccination. Methods A retrospective multicenter cohort study of 17 hospitals included Pfizer/BioNTech's BNT162b2 fully-vaccinated patients who developed COVID-19 more than 7 days after the second vaccine dose and required hospitalization. The risk for poor outcome, defined as a composite of mechanical ventilation or death, was assessed. Results 152 patients were included, accounting for half of hospitalized fully-vaccinated patients in Israel. Poor outcome was noted in 38 patients and mortality rate reached 22% (34/152). Notable, the cohort was characterized by a high rate of comorbidities predisposing to severe COVID-19, including hypertension (108, 71%), diabetes (73, 48%), CHF (41, 27%), chronic kidney and lung diseases (37, 24% each), dementia (29, 19%), and cancer (36, 24%), and only 6 (%) had no comorbidities. Sixty (40%) of the patients were immunocompromised. Higher SARS-CoV-2 viral-load was associated with a significant risk for poor outcome. Risk also appeared higher in patients receiving anti-CD20 treatment and in patients with low titers of anti-spike IgG, but these differences did not reach statistical significance. Conclusions We found that severe COVID-19 infection, associated with a high mortality rate, might develop in a minority of fully-vaccinated individuals with multiple comorbidities. Our patients had a higher rate of comorbidities and immunosuppression compared to previously reported non-vaccinated hospitalized COVID-19 patients. Further characterization of this vulnerable population may help to develop guidance to augment their protection, either by continued social-distancing, or by additional active or passive vaccinations.
Article
MHC class I (MHC‐I) molecules undergo an intricate folding process in order to pick up antigenic peptide to present to the immune system. In recent years, the discovery of a new peptide editor for MHC‐I has added an extra level of complexity in our understanding of how peptide presentation is regulated. On top of this, the incredible diversity in MHC‐I molecules leads to significant variation in the interaction between MHC‐I and components of the antigen processing and presentation pathway. Here, we review our current understanding regarding how polymorphisms in human leukocyte antigen class I molecules influence their interactions with key components of the antigen processing and presentation pathway. A deeper understanding of this may offer new insights regarding how apparently subtle variation in MHC‐I can have a significant impact on susceptibility to disease.
Article
Background Understanding the genetic interplay between human hosts and infectious pathogens is crucial for how we interpret virulence factors. Here, we tested for associations between HIV and host genetics, and interactive genetic effects on viral load (VL) in HIV+ ART-naive clinical trial participants. Methods HIV genomes were sequenced and the encoded amino acid (AA) variants were associated with VL, human single nucleotide polymorphisms (SNPs) and imputed HLA alleles, using generalized linear models with Bonferroni correction. Results Human (388,501 SNPs) and HIV (3,010 variants) genetic data was available for 2,122 persons. Four HIV variants were associated with VL (p-values<1.66×10 -5). Twelve HIV variants were associated with a range of 1–512 human SNPs (p-value<4.28×10 -11). We found 46 associations between HLA alleles and HIV variants (p-values<1.29×10 -7). We found HIV variants and immunotypes when analyzed separately, were associated with lower VL, whereas the opposite was true when analyzed in concert. Epitope binding prediction showed HLA alleles to be weaker binders of associated HIV AA variants relative to alternative variants on the same position. Conclusions Our results show the importance of immunotype specificity on viral antigenic determinants, and the identified genetic interplay puts emphasis that viral and human genetics should be studied in the context of each other.