ArticlePDF Available

Fast decliner phenotype of chronic obstructive pulmonary disease (COPD): applying machine learning for predicting lung function loss

BMJ Group
BMJ Open Respiratory Research
Authors:
  • The Organizational Neuroscience Laboratory | University of Surrey | Warwick University

Abstract and Figures

Background Chronic obstructive pulmonary disease (COPD) is a heterogeneous group of lung conditions challenging to diagnose and treat. Identification of phenotypes of patients with lung function loss may allow early intervention and improve disease management. We characterised patients with the ‘fast decliner’ phenotype, determined its reproducibility and predicted lung function decline after COPD diagnosis. Methods A prospective 4 years observational study that applies machine learning tools to identify COPD phenotypes among 13 260 patients from the UK Royal College of General Practitioners and Surveillance Centre database. The phenotypes were identified prior to diagnosis (training data set), and their reproducibility was assessed after COPD diagnosis (validation data set). Results Three COPD phenotypes were identified, the most common of which was the ‘fast decliner’—characterised by patients of younger age with the lowest number of COPD exacerbations and better lung function—yet a fast decline in lung function with increasing number of exacerbations. The other two phenotypes were characterised by (a) patients with the highest prevalence of COPD severity and (b) patients of older age, mostly men and the highest prevalence of diabetes, cardiovascular comorbidities and hypertension. These phenotypes were reproduced in the validation data set with 80% accuracy. Gender, COPD severity and exacerbations were the most important risk factors for lung function decline in the most common phenotype. Conclusions In this study, three COPD phenotypes were identified prior to patients being diagnosed with COPD. The reproducibility of those phenotypes in a blind data set following COPD diagnosis suggests their generalisability among different populations.
Content may be subject to copyright.
1
NikolaouV, etal. BMJ Open Resp Res 2021;8:e000980. doi:10.1136/bmjresp-2021-000980
To cite: NikolaouV,
MassaroS, GarnW, etal.
Fast decliner phenotype
of chronic obstructive
pulmonary disease (COPD):
applying machine learning
for predicting lung function
loss. BMJ Open Resp Res
2021;8:e000980. doi:10.1136/
bmjresp-2021-000980
Received 6 May 2021
Accepted 19 October 2021
1University of Surrey, Surrey
Business School, Guildford,
UK
2The Organizational
Neuroscience Laboratory,
London, UK
3Hague University of Applied
Sciences, Den Haag, The
Netherlands
4Academic Primary Care,
University of Aberdeen,
Aberdeen, UK
5Optimum Patient Care,
Cambridge, UK
6Observational and Pragmatic
Research Institute, Singapore
Correspondence to
Mr Vasilis Nikolaou;
v. nikolaou@ surrey. ac. uk
Fast decliner phenotype of chronic
obstructive pulmonary disease (COPD):
applying machine learning for
predicting lung function loss
Vasilis Nikolaou,1 Sebastiano Massaro,1,2 Wolfgang Garn,1 Masoud Fakhimi,1
Lampros Stergioulas,3 David B Price4,5,6
Chronic obstructive pulmonary disease
© Author(s) (or their
employer(s)) 2021. Re- use
permitted under CC BY- NC. No
commercial re- use. See rights
and permissions. Published by
BMJ.
ABSTRACT
Background Chronic obstructive pulmonary disease
(COPD) is a heterogeneous group of lung conditions
challenging to diagnose and treat. Identication of
phenotypes of patients with lung function loss may allow
early intervention and improve disease management. We
characterised patients with the ‘fast decliner’ phenotype,
determined its reproducibility and predicted lung function
decline after COPD diagnosis.
Methods A prospective 4 years observational study
that applies machine learning tools to identify COPD
phenotypes among 13 260 patients from the UK Royal
College of General Practitioners and Surveillance Centre
database. The phenotypes were identied prior to
diagnosis (training data set), and their reproducibility was
assessed after COPD diagnosis (validation data set).
Results Three COPD phenotypes were identied, the most
common of which was the ‘fast decliner’—characterised
by patients of younger age with the lowest number of
COPD exacerbations and better lung function—yet a
fast decline in lung function with increasing number
of exacerbations. The other two phenotypes were
characterised by (a) patients with the highest prevalence
of COPD severity and (b) patients of older age, mostly men
and the highest prevalence of diabetes, cardiovascular
comorbidities and hypertension. These phenotypes were
reproduced in the validation data set with 80% accuracy.
Gender, COPD severity and exacerbations were the most
important risk factors for lung function decline in the most
common phenotype.
Conclusions In this study, three COPD phenotypes were
identied prior to patients being diagnosed with COPD.
The reproducibility of those phenotypes in a blind data set
following COPD diagnosis suggests their generalisability
among different populations.
INTRODUCTION
Chronic obstructive pulmonary disease
(COPD) is a widespread group of lung
diseases such as asthma, emphysema and
chronic bronchitis that causes breathing
difficulties as a result of fast lung function
decline.1 Several studies2–5 have shown that
common risk factors associated with lung
function decline in patients with COPD
are smoking,4 emphysema4 and severity of
emphysema,3 as well COPD exacerbations2 5
along with elevated blood eosinophil counts.2
Kerkhof et al2 showed that patients with mild-
to- moderate COPD with a high burden of
exacerbations and elevated blood eosinophils
have significant mitigation of their lung func-
tion decline when treated with inhaled corti-
costeroids (ICS). Despite this finding suggests
that early treatment may prevent further lung
function loss, the full risk profile of those
patients, and their projected lung function
loss, remain key issues still largely unknown
and underexplored in the present literature.
In this study, we aim to tackle these issues
and provide a framework to improve the
characterisation of patients with COPD and
a fast decline in their lung function before
diagnosis. In so doing, we develop several
machine learning algorithms able to predict
lung function decline after diagnosis. The
implementation of this approach promises to
allow medical practitioners with opportuni-
ties for early intervention and prevention of
lung function loss.
Key messages
What are the characteristics of patients with chron-
ic obstructive pulmonary disease (COPD) and a fast
decline in their lung function, and can they be repro-
duced in different populations?
In 13 260 patients with COPD, the ‘fast decliner’
was the most common phenotype, characterised by
younger patients with lung function loss with an in-
creased number of COPD exacerbations.
The ‘fast decliner’ phenotype was reproduced in an
unseen data set after COPD diagnosis. The most im-
portant risk factors for lung function decline were
gender, COPD severity and exacerbations.
copyright. on October 29, 2021 by guest. Protected byhttp://bmjopenrespres.bmj.com/BMJ Open Resp Res: first published as 10.1136/bmjresp-2021-000980 on 29 October 2021. Downloaded from
2NikolaouV, etal. BMJ Open Resp Res 2021;8:e000980. doi:10.1136/bmjresp-2021-000980
Open access
METHODS
Study design
This study is a retrospective analysis of an observational
cohort spanning through a 4 years period (2015–2018)
among patients with COPD in the UK. Data were
extracted from the Royal College of General Practitioners
(RCGP) Research and Surveillance Centre (RSC) data-
base,6 7 which includes more than 5 million patients, and
in which over 2 million records and 500 million prescrip-
tions (as of December 2017) are uploaded each week.8
Study population
Inclusion and exclusion criteria are shown in figure 1.
The study included patients with a Read code9 for COPD
diagnosis, older than 35 years, current or former smoker,
without active asthma, with a forced expiratory volume
in 1 s (FEV1) to forced vital capacity ratio (FEV1/FVC)
of 0.7 (ie, the threshold for COPD diagnosis1) and
who completed FEV1 records for four consecutive years.
Specifically, we used FEV1 records in year 1 as a base-
line, followed- up by at least 3 years of FEV1 recordings.
We excluded patients younger than 35, non- smokers (as
this group may be misdiagnosed with COPD and to align
with National Institute for Health and Care Excellence
(NICE) guidelines,10 those with active asthma and FEV1/
FVC ratio of >0.7, as well as patients with less than 3 years
of lung function (FEV1) values. Our inclusion and exclu-
sion criteria yielded a total of 13 260 patients.
Statistical analysis
To identify patients with underlying COPD phenotypes
we split the cohort into two groups: (a) the training
data set, consisting of patients with COPD registered to
a general practitioner (GP) practice before the COPD
diagnosis; and (b) the validation data set that includes
patients with COPD registered after their COPD diag-
nosis (figure 2). Thus, patients in both data sets share
similar COPD- related characteristics. We divided our
sample according to the COPD diagnosis date, rather
than randomly, to allow our algorithms to learn patterns
in the data prior to COPD diagnosis (training data set)
and classify patients in an unbiased, data- driven way into
clusters (phenotypes). We then used those clusters learnt
in the training data set to predict new clusters for patients
after COPD diagnosis (validation data set) and assessed
their agreement as described below in the ‘Cluster valida-
tion after diagnosis’ section. Similarly, we trained three
different regression algorithms to predict lung func-
tion decline in the training data set. We assessed their
performance in the validation data set as described in the
‘Predictive models’ section.
Data reduction
The training data set was used to group patients of similar
characteristics into distinct clusters (ie, COPD pheno-
types) using k- means cluster analysis (ie, a method that
splits the data into mutually exclusive groups). To apply
k- means clustering, we standardised 19 clinically relevant
Figure 1 Flow chart of study cohort. COPD, chronic
obstructive pulmonary disease; FEV1, forced expiratory
volume in 1 s; FVC, forced vital capacity.
Figure 2 Main steps in phenotype identication before
and after COPD diagnosis. COPD, chronic obstructive
pulmonary disease; MCA, multiple correspondence
analysis; RF, random forest.
copyright. on October 29, 2021 by guest. Protected byhttp://bmjopenrespres.bmj.com/BMJ Open Resp Res: first published as 10.1136/bmjresp-2021-000980 on 29 October 2021. Downloaded from
NikolaouV, etal. BMJ Open Resp Res 2021;8:e000980. doi:10.1136/bmjresp-2021-000980 3
Open access
variables (sex, body mass index, smoking, COPD severity,
COPD exacerbations, emphysema, diabetes, hyperten-
sion, coronary artery disease, acute myocardial infarc-
tion, congestive cardiac failure, anxiety, depression and
six types of treatment) into uncorrelated ones of the same
scale. In other words, prior to cluster analysis, we reduced
the dimensionality of the data from the 19 selected vari-
ables to three uncorrelated components that explained
the most variability of the data by using multiple corre-
spondence analysis (MCA)—the equivalent of principal
components analysis for categorical data.11 Prior to this,
we also imputed the missing values for the categorical
variables of body mass index and COPD severity by using
multivariate imputation by chained equations.12
Clustering
Given that the choice of clusters via k- means needs to
be predetermined in advance, we began our clustering
procedure by performing a hierarchical cluster analysis13
which does not require a predetermined number of clus-
ters. We used the derived dendrogram to visually assess
the optimal number of clusters (figure 3). This selection
process implicates following the branch of the tree with
the largest height (distance from top to bottom) and
drawing a horizontal line (dashed line) across the other
branches. The number of times in which the horizontal
line intersects the branches determines the optimal
number of clusters.
To confirm the number of clusters determined by the
dendrogram’s visual inspection, we performed further
statistical methods, namely the elbow14 and silhouette15
methods. The elbow method measures how close subjects
are within the same cluster by minimising heterogeneity
(or maximising homogeneity): A lower within cluster
variation indicates good compactness. The silhouette
method measures how close a subject in one cluster is
to subjects in neighbouring clusters by using the average
silhouette width to measure the distance between clus-
ters. Here, the bigger the average silhouette width, the
larger the distance between the clusters.
Next, we applied the k- means algorithm (figure 4)
using different clusters (eg, from 1 to 10 clusters). The
point beyond which a further reduction in the within
the sum of squares (or increase in the average silhouette
width) does not change the robustness (or separation)
of clusters allowed us to determine the optimal number
of clusters.
Intriguingly, the silhouette plots can also be used
to determine the robustness of the clusters derived by
using either the hierarchical or the k- means clustering
method.16 In our sample, these outputs indicate that
k- means should be the preferable clustering method rela-
tive to hierarchical clustering (figure 5). This is for two
main reasons. First, the average silhouette width under
the k- means algorithm (figure 5; bottom plot) was bigger
than the one under the hierarchical algorithm (figure 5;
top plot). Second, there were more subjects with nega-
tive silhouette widths under the hierarchical algorithm
than the k- means clustering—especially for clusters 1 and
3—suggesting that the latter method offers more stable
clusters than the former.
Predictive models
We trained three regressors (decision tree, gradient
boosting machine, linear regression) to predict lung
function in the validation data set. We used FEV1 as the
dependent variable, and the 19 variables used in the
MCA step and age as predictors. Moreover, with the R
library ‘caretEnsemble’,17 we constructed two ‘ensemble
models’: a linear and a random forest (RF) ensemble of
the above- mentioned regressors.
All algorithms were first trained and tested on 70%
and 30% of the training data sets (ie, RF train and RF
Figure 3 Inspecting the number of clusters using
hierarchical analysis in the training data set.
Figure 4 Determining the optimal number of clusters in
the training data set.
Figure 5 Silhouette plots to determine the optimal
clustering method—hierarchical (top) and k- means
(bottom).
copyright. on October 29, 2021 by guest. Protected byhttp://bmjopenrespres.bmj.com/BMJ Open Resp Res: first published as 10.1136/bmjresp-2021-000980 on 29 October 2021. Downloaded from
4NikolaouV, etal. BMJ Open Resp Res 2021;8:e000980. doi:10.1136/bmjresp-2021-000980
Open access
test; figure 2), respectively, for finely tuning —by using
automated tuning with the R library ‘caret’17—of their
parameters. They were then re- trained in the full training
data set and tested to assess their final performance in
the blind validation data set. This was done by calculating
the root mean squared error (RMSE) and mean abso-
lute error (MAE). The former is the square root of the
difference between observed and predicted values (ie,
the prediction errors or residuals); it shows how far from
the regression line the prediction errors are and is calcu-
lated as:
RMSE
=
N
i=1
(
xi
ˆ
xi
)
2
N
where
xi
and
xi
are the observed and predicted values,
respectively.
The MAE is instead the MAE between observed and
predicted values; it shows the magnitude of the predic-
tion errors and is calculated as:
MAE =
n
i=1
|y
i
x
i
|
n
where
yi
and
xi
are the predicted and observed values,
respectively.
Finally, we calculated the effect of the most important
predictors on lung function decline by means of the best
performing model in the validation data set (ie, the one
with the lowest RMSE and MAE values).
All statistical analyses were implemented with the statis-
tical software R.18
Patient and public involvement
Patients were not involved in the design, conduct,
reporting or dissemination plans of this study.
RESULTS
Patient characteristics
Table 1 summarises the descriptive characteristics of
patients registered with a GP before and after their COPD
diagnosis at the baseline (first year of the study period).
When looking at the association between FEV1 and the
number of COPD exacerbations before and after COPD
diagnosis (figure 6), the decline in lung function with
an increased number of exacerbations appears to be
faster in the period prior to COPD diagnosis than after
diagnosis. Thus, we examined whether a similar pattern
existed among the phenotypes we derived, as well as the
extent of such a decline.
Prior to COPD diagnosis
Table 2 presents the baseline characteristics of the three
clusters of patients identified for the pre- COPD diagnosis
period.
Phenotype A was characterised by a higher propor-
tion (one- third) of severe/very severe COPD (with
severity being defined by the physician) and a higher
number of COPD exacerbations; almost half of them
had hypertension, and one- third were depressed. Almost
all patients with this phenotype were treated with ICS
and a combination of ICS and LABA (long- acting beta
agonist) treatment, while a considerable proportion was
treated with LAMA (long- acting antimuscarinic) and
mucolytics. Phenotype B was characterised by patients of
an older age, a higher male majority, as well as a higher
proportion of overweight patients, a high prevalence of
diabetes and cardiovascular comorbidities (hyperten-
sion, coronary artery disease, acute myocardial infarc-
tion, congestive cardiac failure) and depression, but the
majority of them had moderate COPD severity. Almost
half of the patients in this phenotype were treated with
ICS and LAMA and one- third of them with an ICS and
LAMA combination. Phenotype C was characterised by
patients of a younger age, more than one- third of whom
were overweight, but almost half of them had moderate
COPD severity. Patients in this phenotype have the lowest
number of COPD exacerbations and better lung (FEV1)
function, yet almost half of them had hypertension and
one- third of them had depression. The most frequent
treatment of those patients was LAMA and mucolytics.
The most noticeable patients’ characteristics for each of
the three derived phenotypes are summarised in table 3.
When observing the association between lung func-
tion and number of COPD exacerbations (figure 7), the
fastest decline in FEV1 was observed in patients of pheno-
type C: Those patients were also younger, suggesting that
phenotype C can resemble the clinical features of the fast
decliner phenotype.2–4
Clusters validation after diagnosis
To validate the cluster assignments derived prior to
COPD diagnosis, we developed a RF model, that used
the three clusters (derived by k- means clustering) as the
dependent variable and the 19 categorical variables (sex,
body mass index, smoking, COPD severity, COPD exac-
erbations, emphysema, diabetes, hypertension, coronary
artery disease, acute myocardial infarction, congestive
cardiac failure, anxiety, depression and six types of treat-
ment) and age, as independent variables.
The RF model was trained on a random sample of the
training data set consisting 70% of the data (RF train
data set; n=8037; figure 2) and tested on the remaining
30% of the data (RF test data set; n=3445; figure 2) for
internal validation.
To improve the RF model’s performance, we used a
10- fold cross- validation method. This method involves
splitting the data in 10 folds (samples): the first nine of
them are used for training and one for testing. Then the
next nine folds are used for training and 1 for testing and
so forth until each one of the 10 folds has been used for
testing. We further optimised the model’s performance
by applying parameter tuning.19 This led to an (internal)
accuracy of 99% for predicting the same clusters in the
RF test data set.
copyright. on October 29, 2021 by guest. Protected byhttp://bmjopenrespres.bmj.com/BMJ Open Resp Res: first published as 10.1136/bmjresp-2021-000980 on 29 October 2021. Downloaded from
NikolaouV, etal. BMJ Open Resp Res 2021;8:e000980. doi:10.1136/bmjresp-2021-000980 5
Open access
The very same model was then trained in the full
training data set (both RF train and RF test) and tested to
predict cluster assignments in the blind validation data set
(figure 2). The predicted clusters were compared with those
of the validation data set—derived with the same approach
described above for the training data set, that is, data reduc-
tion and k- means clustering—using the Adjusted Rand
Index20 and Jaccard Index21 for external clustering valida-
tion (ie, measuring the extent of agreement between clus-
ters derived by two different methods). Both indices showed
Table 1 Baseline (year 1) demographic and clinical characteristics of patients before and after COPD diagnosis
Variables
Prior to COPD diagnosis
(n=11 482)
After COPD diagnosis
(n=1778)
Total
(n=13 260)
Age, mean (SD), years 69 (10) 70 (9) 69 (10)
Sex, male, no. (%) 6526 (57) 1029 (58) 7555 (57)
Body mass index, mean (SD), kg/m227 (6) 27 (6) 27 (6)
Body mass index, no. (%) with data 11 409 (99) 1759 (99) 13 168 (99)
Underweight 403 (3) 85 (5) 488 (4)
Normal weight 4066 (36) 643 (37) 4709 (36)
Overweight 4070 (36) 588 (33) 4658 (35)
Obese 2870 (25) 443 (25) 3313 (25)
Smoking status, no. (%)
Active smoker 4467 (39) 648 (36) 5115 (39)
Former smoker 7015 (61) 1130 (64) 8145 (61)
COPD severity, no. (%) with data 5859 (51) 925 (52) 6784 (51)
Mild 1957 (33) 293 (32) 2250 (33)
Moderate 2831 (48) 433 (47) 3264 (48)
Severe 975 (17) 177 (19) 1152 (17)
Very severe 96 (2) 22 (2) 118 (2)
COPD exacerbations in the past year, mean (SD) 0.3 (0.9) 0.5 (1.3) 0.3 (1.0)
COPD exacerbations in the past year, no. (%)
0 9736 (85) 1395 (79) 11 131 (84)
1 998 (8) 195 (11) 1193 (9)
2 433 (4) 79 (4) 512 (4)
>2 315 (3) 109 (6) 424 (3)
Forced expiratory volume in 1 s, mean (SD), L 0.7 (0.2) 0.7 (0.2) 0.7 (0.2)
Emphysema, no. (%) 646 (6) 248 (14) 894 (7)
Diabetes, no. (%) 1771 (15) 280 (16) 2051 (16)
Hypertension, no. (%) 5317 (46) 823 (46) 6140 (46)
Coronary artery disease, no. (%) 675 (6) 106 (6) 781 (6)
Acute myocardial infarction, no. (%) 822 (7.) 144 (8) 966 (7)
Congestive cardiac failure, no. (%) 719 (6) 110 (6) 829 (6)
Anxiety, no. (%) 938 (8) 142 (8) 1080 (8)
Depression, no. (%) 3490 (30) 582 (33) 4072 (31)
Treatment, no. (%)
ICS 5082 (44) 1056 (59) 6138 (46)
ICS+LABA 4486 (39) 969 (55) 5455 (41)
LAMA 5363 (47) 985 (55) 6348 (48)
LABA 1101 (10) 147 (8) 1248 (9)
SAMA 581 (5) 100 (6) 681 (5)
Mucolytics 1028 (9) 231 (13) 1259 (10)
COPD, chronic obstructive pulmonary disease; ICS, inhaled corticosteroids; LABA, long- acting beta agonist; LAMA, long- acting
antimuscarinic; SAMA, short- acting antimuscarinic.
copyright. on October 29, 2021 by guest. Protected byhttp://bmjopenrespres.bmj.com/BMJ Open Resp Res: first published as 10.1136/bmjresp-2021-000980 on 29 October 2021. Downloaded from
6NikolaouV, etal. BMJ Open Resp Res 2021;8:e000980. doi:10.1136/bmjresp-2021-000980
Open access
an agreement of 80% between the predicted clusters using
the RF model and the clusters derived using k- means clus-
tering in the validation data set.
Predicted lung function loss after diagnosis
Given the prevalence of phenotype C in the sample and
the limited literature on the fast decliner phenotype at
present, we can predict the lung function of patients with
this phenotype by training three regressors (decision
tree, gradient boosting machine and linear regression)
and a linear ensemble of those regressors in the data set
prior to COPD diagnosis (figure 8).
As shown in figure 8, the gradient boosting machine
(gbm) performs better than the linear regression (gener-
alised linear model (glm)) (ie, it has the lowest RMSE
value) and the decision tree (rpart). Moreover, the
performance of those models combined through a linear
model (linear ensemble; red dashed line) is as good as
the gbm model.
We then combined those three models (gbm, glm,
rpart) under a RF ensemble and assessed the perfor-
mance of all models after COPD diagnosis (table 4). We
observe that the linear regression model performed as
suitably as the gbm and the linear ensemble. In contrast,
the rpart and the RF ensemble performed worst in the
validation data set.
Additionally, we used a more conventional linear
regression to calculate the effect of the most important
predictors for lung function decline (table 5) in the
training data set.
As shown in table 5, all of the above predictors explain
95% of the model’s variance. The most important
predictor—that explained 36% of the variance— was
sex, which was associated with a decline in lung function
of 0.066 L (or 66 mL) for male compared with female
patients. The second most important predictor was
COPD severity, which explained 18% of the variance—
where patients with moderate and severe COPD had a
statistically significant lung function decline of 35 mL
and 64 mL, respectively, compared with those with mild
COPD. LAMA treatment was also associated with 37 mL
decline in lung function as well an increased number of
COPD exacerbations—ranged from 38 mL to 51 mL and
78 mL for one, two and more than two exacerbations,
respectively.
The least important predictors were smoking, diabetes
and LABA treatment—which explained from 2% to 4%
of the variance—and predicted a lung function decline
ranging from 16 mL to 24 mL and 28 mL, respectively.
Age, however was associated with a statistically signifi-
cant of 1.7 mL increase per year, which is not a surprising
finding per se given that the fast decliner phenotype was
characterised by better lung function in patients.
DISCUSSION
This study aimed to better characterise patients with
COPD—in particular patients with the ‘fast decliner’
phenotype—by means of statistical and machine learning
tools. Statistical methods, such as MCA11 and cluster anal-
ysis,13–15 are traditionally used in COPD research and
beyond to reduce the dimensionality of the data into
few uncorrelated variables that explain most of the vari-
ability and group subjects of similar characteristics into
homogeneous and distinct clusters. These methods use
all patients’ information by integrating demographics
along with clinical and treatment characteristics. Due to
COPD heterogeneity, this integration allows for better
identification and characterisation of COPD phenotypes
that extends beyond the typical clinical approach (ie,
following the Global Obstructive Lung Disease Initia-
tive recommendations).22 Moreover, machine learning
provides researchers and practitioners with rpart, RF and
gbm models23 24 that can accommodate non- linear rela-
tionships.
Here, we applied these tools to go beyond the tradi-
tional analysis of demographic and clinical character-
istics of patients with COPD to predict lung function
decline after their COPD diagnosis. The strengths of our
approach consist (a) using a prospective longitudinal
public data set, (b) a large sample size of 13 260 patients,
(c) multiple imputations for handling missing values and
(d) a choice of variables to be included in cluster analysis,
as well the number of clusters, by combining data- driven
methods with knowledge from the existing literature and
clinical expertise.
The use of a large sample size allowed us to identify
three distinct clusters (ie, phenotypes) of patients with
different demographic, clinical and treatment charac-
teristics prior to COPD diagnosis able predict similar
clusters of patients’ profiles post- diagnosis with an 80%
agreement. This encouraging finding suggests that
such phenotypes can be reproduced across different
data sets and populations. Another advantage of using
a large sample is the ability to split the training data set
randomly (ie, prior to COPD diagnosis) into RF train and
RF test subsets, train the RF model on the RF train data
set and validate its predictions on the RF test data set—a
process called internal validation. We further validated
the phenotypes on the post COPD diagnosis data set to
Figure 6 Association between lung function and number
of COPD exacerbations before and after COPD diagnosis.
COPD, chronic obstructive pulmonary disease; FEV1,
forced expiratory volume in 1 s.
copyright. on October 29, 2021 by guest. Protected byhttp://bmjopenrespres.bmj.com/BMJ Open Resp Res: first published as 10.1136/bmjresp-2021-000980 on 29 October 2021. Downloaded from
NikolaouV, etal. BMJ Open Resp Res 2021;8:e000980. doi:10.1136/bmjresp-2021-000980 7
Open access
predict cluster assignment and compare these with those
derived by the k- means method.
Moreover, the 10- fold cross- validation used when
training our models (ie, rpart, RF, gbm, linear regression
and ensembles), along with the tuning of the models’
parameters, improves performance and avoids overfit-
ting—a phenomenon observed when the same model
is used for both training and prediction without been
tested (prior to prediction) on an unseen data set (whose
observations did not contribute to its training.
Table 2 Baseline (year 1) phenotype characteristics prior to COPD diagnosis
Variables
Phenotype
A (n=4339) B (n=1040) C (n=6103)
Age, mean (SD), years 69 (9) 73 (8) 68 (10)
Sex, male, no. (%) 2456 (57) 799 (77) 3271 (54)
Body mass index, mean (SD), kg/m227 (6) 29 (5) 27 (5)
Body mass index, no. (%) with data 4311 (99) 1040 (100) 6058 (99)
Underweight 1618 (38) 220 (21) 2228 (37)
Normal weight 1029 (24) 381 (37) 1460 (24)
Overweight 1479 (34) 427 (41) 2164 (36)
Obese 185 (4) 12 (1) 206 (3)
Smoking status, no. (%)
Active smoker 1542 (36) 306 (29) 2619 (43)
Former smoker 2797 (64) 734 (71) 3484 (57)
COPD severity, no. (%) with data 2481 (57) 556 (54) 2822 (46)
Mild 587 (24) 174 (31) 1196 (42)
Moderate 1154 (46) 316 (57) 1361 (48)
Severe 666 (27) 62 (11) 247 (9)
Very severe 74 (3) 4 (1) 18 (1)
COPD exacerbations in the past year, mean (SD) 0.5 (1.2) 0.2 (0.7) 0.1 (0.8)
COPD exacerbations in the past year, no. (%)
0 3323 (77) 899 (86) 5514 (90)
1 497 (11) 85 (8) 416 (7)
2 266 (6) 36 (4) 131 (2)
>2 253 (6) 20 (2) 42 (1)
Forced expiratory volume in 1 s, mean (SD), L 0.7 (0.2) 0.7 (0.2) 0.8 (0.2)
Emphysema, no. (%) 308 (7) 59 (6) 279 (5)
Diabetes, no. (%) 597 (14) 382 (37) 792 (13)
Hypertension, no. (%) 1948 (45) 703 (68) 2666 (44)
Coronary artery disease, no. (%) 33 (1) 617 (59) 25 (0.4)
Acute myocardial infarction, no. (%) 75 (2) 681 (66) 66 (1)
Congestive cardiac failure, no. (%) 223 (5) 304 (29) 192 (3)
Anxiety, no. (%) 319 (7) 101 (10) 518 (9)
Depression, no. (%) 1279 (30) 348 (34) 1863 (31)
Treatment, no. (%)
ICS 4290 (99) 408 (39) 384 (6)
ICS+LABA 4141 (95) 339 (33) 6 (0.1)
LAMA 3022 (70) 437 (42) 1904 (31)
LABA 227 (5) 92 (9) 780 (12.8)
SAMA 206 (5) 64(6) 311 (5)
Mucolytics 756 (17) 108 (10) 164 (23)
COPD, chronic obstructive pulmonary disease; ICS, inhaled corticosteroids; LABA, long- acting beta agonist; LAMA, long- acting
antimuscarinic; SAMA, short- acting antimuscarinic.
copyright. on October 29, 2021 by guest. Protected byhttp://bmjopenrespres.bmj.com/BMJ Open Resp Res: first published as 10.1136/bmjresp-2021-000980 on 29 October 2021. Downloaded from
8NikolaouV, etal. BMJ Open Resp Res 2021;8:e000980. doi:10.1136/bmjresp-2021-000980
Open access
Finally, we ensembled the individual models (ie, rpart,
RF, gbm, linear regression) by using either a linear or a
RF regressor to boost their performance. We then used
the model with the best performance (ie, the linear
regression) to identify the most important risk factors for
lung function decline in patients with the fast decliner
phenotype. Two of those predictors—COPD severity
and COPD exacerbations—projected a decline in lung
function of more than 30 mL, which is constant with
findings of similar studies.2 25Specifically, Kerkhof et al2
used multilevel mixed- effects linear regression models
to determine the association between annual exacerba-
tion rate following initiation of ICS therapy and FEV1
decline. The authors also carried out a longitudinal
study of a similar sample size to ours (n=12 178 patients
with mild- to- moderate COPD) and found a decline in
lung function of 19 mL/year for each exacerbation for
patients with blood eosinophil counts equal to or greater
than 350 cells/µL not on ICS and a reduced lung func-
tion loss that ranged from 4 mL/year to 15 mL/year for
those treated with ICS. In his effort to explore the hetero-
geneity of COPD progression, Papi et al25 reported the
variability in lung function decline from the Evaluation
of COPD Longitudinally to Identify Predictive Surrogate
End- point (ECLIPSE) cohort.4 In this 3- year prospec-
tive study, 38% and 31% of patients had a lung function
decline of more than 40 mL/year and from 21 mL/
year to 40 mL/year, respectively; 23% had a 20 mL/year
decrease to 20 mL/year increase in their lung function,
while just 8% had more than 20 mL/year lung function
increase. In our sample—patients with the fast decliner
phenotype—we observed a decrease of more than 40 mL
in lung function in men (54%), those with severe COPD
(9%) and those with equal to or more than two COPD
Table 3 Phenotypes’ characteristics prior to COPD diagnosis
Phenotype A Phenotype B Phenotype C
Highest prevalence of severe COPD Older age Younger age
Highest number of COPD
exacerbations in the past year
Larger majority of males Overweight (one- third)
Hypertension (almost half) Overweight (almost half) Lowest number of COPD
exacerbations in the past year
Depression (one- third) Highest prevalence of diabetes Better lung function
Most- treated overall Highest prevalence of cardiovascular comorbidities Hypertension (almost half)
ICS (nearly all) Hypertension (two- third) Depression (one- third)
ICS+LABA (nearly all) Coronary artery disease (more than half) Least- treated overall
LAMA (large majority) Acute myocardial infarction (more than half) LAMA (one- third)
Mucolytics Congestive cardiac failure (one- third) Mucolytics
Depression (one- third)
Intermediate level of treatment
ICS (almost half)
ICS+LABA (one- third)
LAMA (almost half)
COPD, chronic obstructive pulmonary disease; ICS, inhaled corticosteroids; LABA, long- acting beta agonist; LAMA, long- acting
antimuscarinic.;
Figure 7 Association between lung function and number
of exacerbations by phenotype—prior to COPD diagnosis.
COPD, chronic obstructive pulmonary disease; FEV1,
forced expiratory volume in 1 s.
Figure 8 Models’ performance on training data set.
The red dashed line shows the performance of the linear
ensemble. RMSE, root mean squared error.
copyright. on October 29, 2021 by guest. Protected byhttp://bmjopenrespres.bmj.com/BMJ Open Resp Res: first published as 10.1136/bmjresp-2021-000980 on 29 October 2021. Downloaded from
NikolaouV, etal. BMJ Open Resp Res 2021;8:e000980. doi:10.1136/bmjresp-2021-000980 9
Open access
exacerbations (3%); a decrease between 21 mL and 40
mL was also observed in patients with moderate or very
severe COPD (49%) as well in those with one COPD
exacerbation (7%), diabetes (13%) and those on LAMA
(31%) and LABA (13%) treatment. We also observed
a decrease in lung function of 16 mL in active smokers
(43%). Furthermore, in their 5- year prospective study,
Nishimura et al3 classified patients with COPD into three
phenotypes based on lung function loss: the fast decliners
with a decline in lung function of 63±2 mL/year, the slow
decliners of a 31±1 mL decline per year and the sustainers
of a 2±1 mL/year decline in their lung function. The
severity of emphysema was found to be independently
associated with a rapid decline in lung function.
Limitations
There are several limitations in our study, which also repre-
sent important calls for future research. One limitation
relates to the quality of the available data, given that the
data were collected from different GP practices with not
standardised measurement processes. As such, the accu-
racy of respiratory values (eg, FEV1) reported may vary
across practices. Moreover, by including patients with at
least 3 years of spirometry follow- ups may improve the reli-
ability of their lung function but could bias the results as
patients with different follow- up times could be different.
Another limitation is the lack of information on how the
presence and/or the severity of emphysema was captured
in our database. While the presence of emphysema in
the RCGP and RSC database is recorded based on the
clinician’s assessment,9 this is not sufficient to capture
its severity. Had a severity score of emphysema, similar
to the one calculated by Nishimura et al3—using a visual
and computerised emphysema severity assessment—
was available, our algorithm would be more accurate to
predict the change in lung function attributed to this risk
factor. A third limitation is the lack of biomarkers from
the RCGP database, such as the eosinophil count which is
a significant predictor in lung function decline.2 Should
biomarkers be used as predictors, our regressors would
be more accurate to predict lung function in a blind data
set. Our sample also lacks detailed treatment informa-
tion such as dosage and frequency of treatment intake.
Should such information had added to our model, a GP
could infer by what amount a treatment can be adjusted
or how frequently should be taken to mitigate lung func-
tion loss. We, however believe that these are all important
calls for future research and would be potentially tackled
in the future by applying our models for prediction on
Table 4 Models’ performance metrics on the validation
data set
RMSE MAE
Decision tree 0.183 0.149
Gradient boosting machine 0.181 0.147
Linear regression 0.181 0.147
Linear ensemble* 0.181 0.147
Random forest ensemble* 0.188 0.152
*Ensemble of three models: decision tree, gradient boosting machine
and linear regression.
MAE, mean absolute error; RMSE, root mean squared error.
Table 5 Risk factors for lung function decline prior to COPD diagnosis
Estimate 95% CI P value % variance
Sex, male* −0.066 −0.07 to −0.06 <0.001 36
COPD severity† 18
Moderate −0.035 −0.04 to −0.03 <0.001
Severe −0.064 −0.07 to −0.05 <0.001
Very severe −0.031 −0.06 to 0.002 0.075
LAMA, yes‡ −0.037 −0.04 to −0.03 <0.001 12
Age (years) 0.0017 0.001 to 0.002 <0.001 10
COPD exacerbations in the past year§ 9
1 −0.038 −0.05 to −0.03 <0.001
2 −0.051 −0.07 to −0.03 <0.001
>2 −0.078 −0.10 to −0.05 <0.001
LABA, yes‡ −0.028 −0.03 to −0.02 <0.001 4
Smoking¶ 4
Active smoker −0.016 −0.02 to −0.01 <0.001
Diabetes, yes‡ −0.024 −0.03 to −0.02 <0.001 2
*Reference group: Female.
†Reference group: Mild.
‡Reference group: No.
§Reference group: 0 exacerbations.
¶Reference group: Former smoker.
CI, Condence Interval; COPD, chronic obstructive pulmonary disease; LABA, long- acting beta agonist; LAMA, long- acting antimuscarinic.
copyright. on October 29, 2021 by guest. Protected byhttp://bmjopenrespres.bmj.com/BMJ Open Resp Res: first published as 10.1136/bmjresp-2021-000980 on 29 October 2021. Downloaded from
10 NikolaouV, etal. BMJ Open Resp Res 2021;8:e000980. doi:10.1136/bmjresp-2021-000980
Open access
other available COPD data sets, such as the Optimum
Patients Care Research Database (OPCRD) database,26
which also contains a proper assessment of emphysema
severity and biomarker information.
Despite the above limitations, this work represents, to
the best of our knowledge, the first study—among those
studies that have implemented machine learning to
identify clinically meaningful COPD phenotypes27—that
fully characterises patients with COPD with a fast decline
in their lung function as well as predicts lung function
loss. This was achieved using regressors ranging from
the conventional linear regression to the most advanced
rpart, RF and gbm.
First, we used k- means clustering to identify three
COPD phenotypes prior to diagnosis. Next, using a RF
model, we showed that these phenotypes can be repro-
duced in a different blind data set (after COPD diagnosis)
by achieving a high level of agreement (80%) between
the predicted cluster assignments to those derived by
k- means clustering.
Additionally, we trained three models (rpart, gbm and
glm) on the data set prior to COPD diagnosis and vali-
dated them after diagnosis to predict lung function loss
after diagnosis. We further developed two ensembles
models using either a linear or a RF model to improve
the performance in the blind validation data set. We
found that the most advanced machine learning models
were as good as the linear regression model. This led us
to identify several risk factors to predict lung function
loss in patients with the fast decliner phenotype. Similar
models can be developed for the other two phenotypes,
which are included in our future research agenda.
Moving forward, we anticipate that validations of our
framework in non- UK populations may help further
understand individual patient lung function profiles,
improve treatment decision- making in patients with
COPD with major lung function decline and prevent
lung function loss at an early stage.
Acknowledgements We acknowledge patients for allowing their data to be used
for surveillance and research. Practices who have agreed to be part of the RCGP
RSC and allow us to extract and used health data for surveillance and research.
Ms Filipa Ferreira from RCGP and Mr Julian Sherlock from the University of
Surrey. Apollo Medical Systems for data extraction. Collaboration with EMIS, TPP,
In- Practice and Micro- test CMR supplier for facilitating data extraction. Colleagues
at Public Health England.
Contributors VN is responsible for conceptualisation, data curation, formal
analysis, investigation, methodology, validation, visualisation, writing of the
original draft, reviewing and editing the nal manuscript. SM is responsible
for reviewing, writing and editing the nal manuscript. WG, MF and LS are
responsible for providing resources, software, supervision, validation and reviewing
the manuscript. DBP is responsible for conceptualisation and reviewing the
manuscript. All authors approved the nal version of this manuscript and agree to
be accountable for all aspects of the work. VN acts as a gaurantor of the overall
content of the study.
Funding The authors have not declared a specic grant for this research from any
funding agency in the public, commercial or not- for- prot sectors.
Competing interests VN is an employee of Parexel. SM is the director of
the Organisational Neuroscience Laboratory. DBP declares advisory board
membership with Aerocrine, Amgen, AstraZeneca, Boehringer Ingelheim, Chiesi,
Mylan, Mundipharma, Napp Pharmaceuticals, Novartis and Teva; consultancy
agreements with Almirall, Amgen, AstraZeneca, Boehringer Ingelheim, Chiesi,
GlaxoSmithKline, Mylan, Mundipharma, Napp Pharmaceuticals, Novartis, Pzer,
Teva and Theravance; grants and unrestricted funding for investigator- initiated
studies (conducted through Observational and Pragmatic Research Institute) from
Aerocrine, AKL Research and Development, AstraZeneca, Boehringer Ingelheim,
British Lung Foundation, Chiesi, Mylan, Mundipharma, Napp Pharmaceuticals,
Novartis, Pzer, Respiratory Effectiveness Group, Teva, Theravance, UK National
Health Service and Zentiva; payment for lectures/speaking engagements from
Almirall, AstraZeneca, Boehringer Ingelheim, Chiesi, Cipla, GlaxoSmithKline, Kyorin,
Mylan, Merck, Mundipharma, Novartis, Pzer, Skyepharma and Teva; payment for
manuscript preparation from Mundipharma and Teva; payment for the development
of educational materials from Mundipharma and Novartis; payment for travel/
accommodation/meeting expenses from Aerocrine, AstraZeneca, Boehringer
Ingelheim, Mundipharma, Napp Pharmaceuticals, Novartis and Teva; funding
for patient enrolment or completion of research from Chiesi, Novartis, Teva
and Zentiva; stock/stock options from AKL Research and Development, which
produces phytopharmaceuticals; owns 74% of the social enterprise Optimum
Patient Care (Australia and UK) and 74% of Observational and Pragmatic Research
Institute (Singapore);); 5% shareholding in Timestamp, which develops adherence
monitoring technology; is peer reviewer for grant committees of the Efcacy and
Mechanism Evaluation programme and Health Technology Assessment; and was
an expert witness for GlaxoSmithKline.
Patient and public involvement Patients and/or the public were not involved in
the design, or conduct, or reporting, or dissemination plans of this research.
Patient consent for publication Not applicable.
Ethics approval University of Surrey’s Institutional Review Board (353003- 352994-
40371074).
Provenance and peer review Not commissioned; externally peer reviewed.
Data availability statement Data may be obtained from a third party and are not
publicly available.
Open access This is an open access article distributed in accordance with the
Creative Commons Attribution Non Commercial (CC BY- NC 4.0) license, which
permits others to distribute, remix, adapt, build upon this work non- commercially,
and license their derivative works on different terms, provided the original work is
properly cited, appropriate credit is given, any changes made indicated, and the
use is non- commercial. See:http:// creativecommons. org/ licenses/ by- nc/ 4. 0/.
REFERENCES
1 Nhs inform on chronic obstructive pulmonary disease. Available:
https://www. nhsinform. scot/ illnesses- and- conditions/ lungs- and-
airways/ copd/ chronic- obstructive- pulmonary- disease# about- copd
[Accessed 15 Feb 2020].
2 Kerkhof M, Voorham J, Dorinsky P, etal. Association between
COPD exacerbations and lung function decline during maintenance
therapy. Thorax 2020;75:744–53.
3 Nishimura M, Makita H, Nagai K, etal. Annual change in pulmonary
function and clinical phenotype in chronic obstructive pulmonary
disease. Am J Respir Crit Care Med 2012;185:44–52.
4 Vestbo J, Edwards LD, Scanlon PD, etal. Changes in forced
expiratory volume in 1 second over time in COPD. N Engl J Med
2011;365:1184–92.
5 Kerkhof M, Voorham J, Dorinsky P, etal. The long- term burden of
COPD exacerbations during maintenance therapy and lung function
decline. Int J Chron Obstruct Pulmon Dis 2020;15:15.
6 Royal College of general practitioners (RCG) research and
surveillance centre (RSC). Available: http://www. rcgp. org. uk/ rsc
7 de Lusignan S, Correa A, Smith GE, etal. RCGP research and
surveillance centre: 50 years' surveillance of inuenza, infections,
and respiratory conditions. Br J Gen Pract 2017;67:440–1.
8 Correa A, Hinton W, McGovern A, etal. Royal College of general
practitioners research and surveillance centre (RCGP RSC) sentinel
network: a cohort prole. BMJ Open 2016;6:e011092.
9 Coded thesaurus of clinical terms. Available: https:// digital. nhs. uk/
services/ terminology- and- classications/ read- codes [Accessed 01
Apr 2018].
10 NICE. Overview | chronic obstructive pulmonary disease in over 16S:
diagnosis and management | guidance | NICE. Available: https://
www. nice. org. uk/ guidance/ng115 [Accessed : 25 Feb 2019].
11 Mori Y, Kuroda M, Makino N. Nonlinear principal component
analysis. In: Nonlinear principal component analysis and its
applications. Singapore: Springer, 2016: 7–20.
12 Buuren Svan, Groothuis- Oudshoorn K. mice : Multivariate Imputation
by Chained Equations in R. J Stat Softw 2011;45:1–67.
13 Murtagh F, Legendre P. Ward’s hierarchical agglomerative clustering
method: which algorithms implement Ward’s criterion? Journal of
Classication 2014;31:274–95.
copyright. on October 29, 2021 by guest. Protected byhttp://bmjopenrespres.bmj.com/BMJ Open Resp Res: first published as 10.1136/bmjresp-2021-000980 on 29 October 2021. Downloaded from
NikolaouV, etal. BMJ Open Resp Res 2021;8:e000980. doi:10.1136/bmjresp-2021-000980 11
Open access
14 Bholowalia P, Kumar A. EBK- means: a clustering technique based
on elbow method and k- means in WSN. International Journal of
Computer Applications 2014;105:17–24.
15 Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation
and validation of cluster analysis. J Comput Appl Math
1987;20:53–65.
16 Pikoula M, Quint JK, Nissen F, etal. Identifying clinically important
COPD sub- types using data- driven approaches in primary care
population based electronic health records. BMC Med Inform Decis
Mak 2019;19:86.
17 Deane- Mayer ZA, Knowles JE. Ensembles of Caret Models.
“Package caretEnsemble”, 2019. Available: https:// github. com/
zachmayer/ caretEnsemble
18 R Core Team. R: a language and environment for statistical
computing. Vienna, Austria: R Foundation for Statistical Computing,
2013. http://www. R- project. org/
19 Breiman L. Random forests. Mach Learn 2001;45:5–32.
20 Steinley D. Properties of the Hubert- Arabie adjusted Rand index.
Psychol Methods 2004;9:386–96.
21 Fletcher S, Islam MZ. Comparing sets of patterns with the Jaccard
index. Australasian Journal of Information Systems 2018;22.
22 Global Initiative for Chronic Obstructive Lung Disease. Pocket guide
to COPD diagnosis, management and prevention, report, 2019.
Available: https:// goldcopd. org/ wp- content/ uploads/ 2018/ 11/ GOLD-
2019- POCKET- GUIDE- FINAL_ WMS. pdf [Accessed 15 February
2020].
23 Breiman L, Friedman JH, Olshen RA, etal. Classication and
regression trees. Belmont, CA: Wadsworth. International Group
1984;432:151–66.
24 Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal
2002;38:367–78.
25 Papi A, Magnoni MS, Muzzio CC, etal. Phenomenology of COPD:
interpreting phenotypes with the eclipse study. Monaldi Arch Chest
Dis 2016;83:721.
26 Clinical practice research Datalink (CPRD) and optimum patient care
research database (OPCRD) http://www. cprd. com/;https:// opcrd.
co. uk/
27 Nikolaou V, Massaro S, Fakhimi M, etal. Copd phenotypes and
machine learning cluster analysis: a systematic review and future
research agenda. Respir Med 2020;171:106093.
copyright. on October 29, 2021 by guest. Protected byhttp://bmjopenrespres.bmj.com/BMJ Open Resp Res: first published as 10.1136/bmjresp-2021-000980 on 29 October 2021. Downloaded from
... The models were constructed with four machine learning algorithms: Random Forest, gradient boosted machines (GBM), neural network with a single hidden layer, and support vector machines (SVM) with radial kernel. Selection of the optimal values of parameters controlling model behavior such as number of random trees, neurons in the hidden layer, or cost parameter was motivated by the maximum of Youden's J statistic (classification models of LFT abnormalities) or minimum mean absolute error (MAE, regression models of LFT readouts) in 10-repeats 10-fold cross-validation [15][16][17]. Because of the presence of participant-matched observation, blocked cross-validation design was used both in the model selection and model evaluation, with blocks defined by participant's identifier. ...
Article
Full-text available
Objectives: Prediction of lung function deficits following pulmonary infection is challenging and suffers from inaccuracy. We sought to develop machine-learning models for prediction of post-inflammatory lung changes based on COVID-19 recovery data. Methods: In the prospective CovILD study (n = 420 longitudinal observations from n = 140 COVID-19 survivors), data on lung function testing (LFT), chest CT including severity scoring by a human radiologist and density measurement by artificial intelligence, demography, and persistent symptoms were collected. This information was used to develop models of numeric readouts and abnormalities of LFT with four machine learning algorithms (Random Forest, gradient boosted machines, neural network, and support vector machines). Results: Reduced DLCO (diffusion capacity for carbon monoxide <80% of reference) was found in 94 (22%) observations. Those observations were modeled with a cross-validated accuracy of 82–85%, AUC of 0.87–0.9, and Cohen’s κ of 0.45–0.5. No reliable models could be established for FEV1 or FVC. For DLCO as a continuous variable, three machine learning algorithms yielded meaningful models with cross-validated mean absolute errors of 11.6–12.5% and R² of 0.26–0.34. CT-derived features such as opacity, high opacity, and CT severity score were among the most influential predictors of DLCO impairment. Conclusions: Multi-parameter machine learning trained with demographic, clinical, and artificial intelligence chest CT data reliably and reproducibly predicts LFT deficits and outperforms single markers of lung pathology and human radiologist’s assessment. It may improve diagnostic and foster personalized treatment.
... However, the association between PF, LAA scores, and clinical prognosis in COPD patients remains an area of ongoing research [9]. The LAA score is the primary method for evaluating respiratory tract and alveolar lesions in HRCT scans [10,11]. Specifically, in chest HRCT scans of COPD patients, it refers to the area of low-density lesion region that appears due to emphysema [12]. ...
Article
Full-text available
The objective of this study was to investigate the relationship between low attenuation area (LAA) scores, pulmonary function parameters, and clinical prognosis in patients with chronic obstructive pulmonary disease (COPD). COPD patients were divided into four LAA-based grades. Various lung function parameters were measured and correlated with LAA scores. Patient symptoms were examined using the St. George’s Respiratory Questionnaire (SGRQ) and exercise capacity using the 6-min walk test (6MWT). Statistical analysis determined the significance of differences. Higher levels of LAA were associated with decreased lung function and airflow limitations, suggesting a positive relationship between the two. Clinical symptom scores increased as COPD severity based on LAA stratification worsened. Reduced exercise capacity was shown by a substantial decline in 6MWT scores as COPD severity increased. As LAA scores increased, SGRQ scores increased, indicating a decreased quality of life (QOL). The study demonstrated a relationship between LAA scores and COPD severity. High LAA scores were associated with poor lung function, worse clinical symptoms, limited exercise capacity, and lower QOL. These findings show that LAA scores are clinically relevant for disease severity assessment and COPD management. Further research is required to determine LAA scores’ prognostic significance in disease progression and treatment response to enhance COPD therapy.
... Although we did not test other algorithms, notably decision tree models, we replicated findings of a recent study that did. Prior studies have attempted to phenotype subjects at risk of lung function decline using machine learning methods such as decision tree models in other cohorts, but did not adjust for decline associated with sex, which was the greatest predictor of decline (Nikolaou et al., 2021). As stated above, prediction of FEV 1 remains challenging, and continued refinement and optimization of our models will be required before these techniques can be applied clinically to account for lung function decline that may be fairly subtle over a few years. ...
Article
Full-text available
Purpose: The purpose of this study was to train and validate machine learning models for predicting rapid decline of forced expiratory volume in 1 s (FEV1) in individuals with a smoking history at-risk-for chronic obstructive pulmonary disease (COPD), Global Initiative for Chronic Obstructive Lung Disease (GOLD 0), or with mild-to-moderate (GOLD 1–2) COPD. We trained multiple models to predict rapid FEV1 decline using demographic, clinical and radiologic biomarker data. Training and internal validation data were obtained from the COPDGene study and prediction models were validated against the SPIROMICS cohort. Methods: We used GOLD 0–2 participants (n = 3,821) from COPDGene (60.0 ± 8.8 years, 49.9% male) for variable selection and model training. Accelerated lung function decline was defined as a mean drop in FEV1% predicted of > 1.5%/year at 5-year follow-up. We built logistic regression models predicting accelerated decline based on 22 chest CT imaging biomarker, pulmonary function, symptom, and demographic features. Models were validated using n = 885 SPIROMICS subjects (63.6 ± 8.6 years, 47.8% male). Results: The most important variables for predicting FEV1 decline in GOLD 0 participants were bronchodilator responsiveness (BDR), post bronchodilator FEV1% predicted (FEV1.pp.post), and CT-derived expiratory lung volume; among GOLD 1 and 2 subjects, they were BDR, age, and PRMlower lobes fSAD. In the validation cohort, GOLD 0 and GOLD 1–2 full variable models had significant predictive performance with AUCs of 0.620 ± 0.081 (p = 0.041) and 0.640 ± 0.059 (p < 0.001). Subjects with higher model-derived risk scores had significantly greater odds of FEV1 decline than those with lower scores. Conclusion: Predicting FEV1 decline in at-risk patients remains challenging but a combination of clinical, physiologic and imaging variables provided the best performance across two COPD cohorts.
... Five respiratory health conditions were identified in 12 studies [30,32,114,115,41, [107][108][109][110][111][112][113]. Chronic obstructive pulmonary disease (COPD) (J40) (n= 5) studies were prognostic predictive studies tackling mortality and hospitalization. ...
Preprint
Full-text available
Aim: With the rapid advances in technology and data science, machine learning (ML) is being adopted by the health care sector; but there is a lack of literature addressing the health conditions targeted by the ML prediction models within primary health care (PHC). To fill this gap in knowledge, we conducted a systematic review following the PRISMA guidelines to identify the health conditions targeted by ML in PHC. Methods: We searched the Cochrane Library, Web of Science, PubMed, Elsevier, BioRxiv, Association of Computing Machinery (ACM), and IEEE Xplore databases for studies published from January 1990 to January 2022. We included any primary study addressing ML diagnostic or prognostic predictive models that were supplied completely or partially by real-world PHC data. We performed literature screening, data extraction, and risk of bias assessment. Health conditions were categorized according to international classification of diseases. Extracted date were analyzed quantitatively and qualitatively. Results: We identified 109 studies investigating 42 health conditions. These studies included 273 ML prediction models supplied by the PHC data of 24.2 million participants from 19 countries. We found that 82% of the studies were retrospective. 76.6% of the studies reported diagnostic predictive ML models. 77% of all reported models aimed for models’ development without external validation. Risk of bias assessment revealed that 90.8% of the studies were of high or unclear risk of bias. The most frequently reported health conditions were Alzheimer’s disease and diabetes mellitus. Conclusions: To the best of our knowledge, this is the first review to investigate the extent of the health conditions targeted by the ML prediction models within PHC settings. Our study provides an important summary on the presently available ML models in PHC, which can be used in further research and implementation efforts.
Article
Full-text available
Background: Implantable cardioverter defibrillators (ICDs) reduce mortality associated with ventricular arrhythmia in high-risk patients with cardiovascular disease. Machine learning (ML) approaches are promising tools in arrhythmia research; however, their application in predicting ventricular arrhythmias in patients with ICDs remains unexplored. We aimed to predict and stratify ventricular arrhythmias requiring ICD therapy using 12-lead electrocardiograms (ECGs) in patients with an ICD. Methods and Results: This retrospective analysis included 200 adult patients who underwent ICD implantation at a single center. Patient demographics, clinical features, and 12-lead ECG data were collected. Unsupervised learning techniques, including K-means and hierarchical clustering, were used to stratify patients based on 12-lead ECG features. Dimensionality reduction methods were also used to optimize clustering accuracy. The silhouette coefficient was used to determine the optimal method and number of clusters. Of the 200 patients, 59 (29.5%) received appropriate therapy. The mean age of patients was 62.3 years, and 81.0% were male. The mean follow-up period was 2,953 days, with no significant intergroup differences. Hierarchical clustering into 3 clusters proved to be the most accurate (silhouette coefficient=0.585). Kaplan-Meier curves for these 3 clusters revealed significant differences (P=0.026). Conclusions: We highlight the potential of ML-based clustering using 12-lead ECGs to help in the risk stratification of ventricular arrhythmia. Future research in a larger multicenter setting may provide further insights and refine ICD indications.
Article
Full-text available
With the advances in technology and data science, machine learning (ML) is being rapidly adopted by the health care sector. However, there is a lack of literature addressing the health conditions targeted by the ML prediction models within primary health care (PHC) to date. To fill this gap in knowledge, we conducted a systematic review following the PRISMA guidelines to identify health conditions targeted by ML in PHC. We searched the Cochrane Library, Web of Science, PubMed, Elsevier, BioRxiv, Association of Computing Machinery (ACM), and IEEE Xplore databases for studies published from January 1990 to January 2022. We included primary studies addressing ML diagnostic or prognostic predictive models that were supplied completely or partially by real-world PHC data. Studies selection, data extraction, and risk of bias assessment using the prediction model study risk of bias assessment tool were performed by two investigators. Health conditions were categorized according to international classification of diseases (ICD-10). Extracted data were analyzed quantitatively. We identified 106 studies investigating 42 health conditions. These studies included 207 ML prediction models supplied by the PHC data of 24.2 million participants from 19 countries. We found that 92.4% of the studies were retrospective and 77.3% of the studies reported diagnostic predictive ML models. A majority (76.4%) of all the studies were for models’ development without conducting external validation. Risk of bias assessment revealed that 90.8% of the studies were of high or unclear risk of bias. The most frequently reported health conditions were diabetes mellitus (19.8%) and Alzheimer’s disease (11.3%). Our study provides a summary on the presently available ML prediction models within PHC. We draw the attention of digital health policy makers, ML models developer, and health care professionals for more future interdisciplinary research collaboration in this regard.
Article
Pseudomonas aeruginosa (P. aeruginosa) is a pathogen that persistently colonizes the respiratory tract of patients with chronic lung diseases. The risk of acquiring a chronic P. aeruginosa infection can be minimized by rapidly detecting the pathogen in the patient's airways and promptly administrating adequate antibiotics. However, the rapid detection of P. aeruginosa in the lungs involves the analysis of sputum, which is a highly complex matrix that is not always available. Here, we propose an alternative diagnosis based on analyzing breath aerosols. In this approach, nanoparticle immunosensors identify bacteria adhered to the polypropylene layer of a surgical facemask that was previously worn by the patient. A polypropylene processing protocol was optimized to ensure the efficient capture and analysis of the target pathogen. The proposed analytical platform has a theoretical limit of detection of 105 CFU mL-1 in aerosolized mock samples, and a dynamic range between 105 and 108 CFU mL-1. When tested with facemasks worn by patients, the biosensors were able to detect chronic and acute P. aeruginosa lung infections, and to differentiate them from respiratory infections caused by other pathogens. The results shown here pave the way to diagnose Pseudomonas infections at the bedside, as well as to identify the progress from chronic to acute infection.
Article
Background and objective: Obstructive airway diseases, including asthma and Chronic Obstructive Pulmonary Disease (COPD), are two of the most common chronic respiratory health problems. Both of these conditions require health professional expertise in making a diagnosis. Hence, this process is time intensive for healthcare providers and the diagnostic quality is subject to intra- and inter- operator variability. In this study we investigate the role of automated detection of obstructive airway diseases to reduce cost and improve diagnostic quality. Methods: We investigated the existing body of evidence and applied Preferred Reporting Items for Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines to search records in IEEE, Google scholar, and PubMed databases. We identified 65 papers that were published from 2013 to 2022 and these papers cover 67 different studies. The review process was structured according to the medical data that was used for disease detection. We identified six main categories, namely air flow, genetic, imaging, signals, and miscellaneous. For each of these categories, we report both disease detection methods and their performance. Results: We found that medical imaging was used in 14 of the reviewed studies as data for automated obstructive airway disease detection. Genetics and physiological signals were used in 13 studies. Medical records and air flow were used in 9 and 7 studies, respectively. Most papers were published in 2020 and we found three times more work on Machine Learning (ML) when compared to Deep Learning (DL). Statistical analysis shows that DL techniques achieve higher Accuracy (ACC) when compared to ML. Convolutional Neural Network (CNN) is the most common DL classifier and Support Vector Machine (SVM) is the most widely used ML classifier. During our review, we discovered only two publicly available asthma and COPD datasets. Most studies used private clinical datasets, so data size and data composition are inconsistent. Conclusions: Our review results indicate that Artificial Intelligence (AI) can improve both decision quality and efficiency of health professionals during COPD and asthma diagnosis. However, we found several limitations in this review, such as a lack of dataset consistency, a limited dataset and remote monitoring was not sufficiently explored. We appeal to society to accept and trust computer aided airflow obstructive diseases diagnosis and we encourage health professionals to work closely with AI scientists to promote automated detection in clinical practice and hospital settings.
Article
Full-text available
Introduction Early identification of preventable risk factors of COPD progression is important. Whether exacerbations have a negative impact on disease progression is largely unknown. We investigated whether the long-term occurrence of exacerbations is associated with lung function decline at early stages of COPD. Methods Patients diagnosed with mild/moderate COPD (obstruction and FEV1% predicted 50–90%), aged ≥35 years, and a smoking history, who had ≥6 years of UK electronic medical records after initiation of maintenance therapy were studied. Multilevel mixed-effect linear regression was performed to determine the association between the count of any year in which the patient had ≥1 exacerbation over a 6-year period and FEV1 decline, adjusted for sex, age, anthropometrics and smoking habits. Exacerbations were defined as any prescription for an acute oral corticosteroid course and/or lower respiratory-related antibiotics and/or any COPD-related emergency or inpatient hospitalization. Results Of 11,337 patients included (mean age 65 years; 49% female) 31.6%, 23.3%, 16.6%, 11.6%, 8.1%, 5.3% and 3.4% had 0, 1, 2, 3, 4, 5 and 6 years with ≥1 exacerbation. The mean annual FEV1 decline accelerated by 1.50 mL/year (95% Confidence Interval 1.02; 1.98) with every additional year with ≥1 exacerbation from 31.0 mL/year in subjects without any exacerbation to 40.0 mL/year in patients experiencing ≥1 exacerbation every year. Patients with more years with ≥1 exacerbation had a lower mean FEV1 at first diagnosis: 14.7 mL (11.7; 17.8) lower with every additional year with exacerbations. When counting years with ≥2 exacerbations, greater effects were observed (2.19 [1.50; 2.88] mL/year excess decline per year with ≥2 exacerbations; 16.5 mL [12.1; 20.8] lower FEV1 at diagnosis). Conclusion Patients who experienced a greater exacerbation burden after initiation of maintenance therapy had worse lung function at diagnosis and a more rapid lung function decline thereafter, which emphasizes the need for better treatment strategies.
Article
Full-text available
Background Little is known about the impact of exacerbations on COPD progression or whether inhaled corticosteroid (ICS) use and blood eosinophil count (BEC) affect progression. We aimed to assess this in a prospective observational study. Methods The study population included patients with mild to moderate COPD, aged ≥35 years, with a smoking history, who were followed up for ≥3 years from first to last spirometry recording using two large UK electronic medical record databases: Clinical Practice Research Datalink (CPRD) and Optimum Patient Care Research Database (OPCRD). Multilevel mixed-effects linear regression models were used to determine the relationship between annual exacerbation rate following initiation of therapy (ICS vs non-ICS) and FEV 1 decline. Effect modification by blood eosinophils was studied through interaction terms. Results Of 12178 patients included (mean age 66 years; 48% female), 8981 (74%) received ICS. In patients with BEC ≥350 cells/µL not on ICS, each exacerbation was associated with subsequent acceleration of FEV 1 decline of 19.4 mL/year (95% CI 12.0 to 26.7, p<0.0001). This excess decline was reduced by 15.1 mL/year (6.6 to 23.6) to 4.3 mL/year (1.9 to 6.7, p<0.0001) in those with BEC ≥350 cells/µL treated with ICS. Conclusion Exacerbations are associated with a more rapid loss of lung function among COPD patients with elevated blood eosinophils, defined as ≥350 cells/µL, not treated with ICS. More aggressive prevention of exacerbations using ICS in such patients may prevent excess loss of lung function.
Article
Full-text available
Background: COPD is a highly heterogeneous disease composed of different phenotypes with different aetiological and prognostic profiles and current classification systems do not fully capture this heterogeneity. In this study we sought to discover, describe and validate COPD subtypes using cluster analysis on data derived from electronic health records. Methods: We applied two unsupervised learning algorithms (k-means and hierarchical clustering) in 30,961 current and former smokers diagnosed with COPD, using linked national structured electronic health records in England available through the CALIBER resource. We used 15 clinical features, including risk factors and comorbidities and performed dimensionality reduction using multiple correspondence analysis. We compared the association between cluster membership and COPD exacerbations and respiratory and cardiovascular death with 10,736 deaths recorded over 146,466 person-years of follow-up. We also implemented and tested a process to assign unseen patients into clusters using a decision tree classifier. Results: We identified and characterized five COPD patient clusters with distinct patient characteristics with respect to demographics, comorbidities, risk of death and exacerbations. The four subgroups were associated with 1) anxiety/depression; 2) severe airflow obstruction and frailty; 3) cardiovascular disease and diabetes and 4) obesity/atopy. A fifth cluster was associated with low prevalence of most comorbid conditions. Conclusions: COPD patients can be sub-classified into groups with differing risk factors, comorbidities, and prognosis, based on data included in their primary care records. The identified clusters confirm findings of previous clustering studies and draw attention to anxiety and depression as important drivers of the disease in young, female patients.
Article
Full-text available
The ability to extract knowledge from data has been the driving force of Data Mining since its inception, and of statistical modeling long before even that. Actionable knowledge often takes the form of patterns, where a set of antecedents can be used to infer a consequent. In this paper we offer a solution to the problem of comparing different sets of patterns. Our solution allows comparisons between sets of patterns that were derived from different techniques (such as different classification algorithms), or made from different samples of data (such as temporal data or data perturbed for privacy reasons). We propose using the Jaccard index to measure the similarity between sets of patterns by converting each pattern into a single element within the set. Our measure focuses on providing conceptual simplicity, computational simplicity, interpretability, and wide applicability. The results of this measure are compared to prediction accuracy in the context of a real-world data mining scenario.
Article
Full-text available
p>The Evaluation of COPD Longitudinally to Identify Predictive Surrogate End-points (ECLIPSE) study was a large 3-year observational multicentre international study aimed at defining COPD phenotypes and identifying biomarkers and/or genetic parameters that help to predict disease progression. The study has contributed to a better understanding of COPD heterogeneity, with the characterization of clinically important subtypes/phenotypes of patients, such as the frequent exacerbators or patient with persistent systemic inflammation, who may have different prognosis or treatment requirements. Because of the big amount of information that is starting to be produced from metabolomic, proteomic and genomic approaches, one of the biggest challenges is the integration of data in a biological prospective such as clinical prognosis and response to medicinal products. In this article we highlight some of the progress in phenotyping the heterogeneity of the disease that have been made thanks to the analyses of this longitudinal study.</p
Article
Full-text available
Purpose The Royal College of General Practitioners Research and Surveillance Centre (RCGP RSC) is one of the longest established primary care sentinel networks. In 2015, it established a new data and analysis hub at the University of Surrey. This paper evaluates the representativeness of the RCGP RSC network against the English population. Participants and method The cohort includes 1 042 063 patients registered in 107 participating general practitioner (GP) practices. We compared the RCGP RSC data with English national data in the following areas: demographics; geographical distribution; chronic disease prevalence, management and completeness of data recording; and prescribing and vaccine uptake. We also assessed practices within the network participating in a national swabbing programme. Findings to date We found a small over-representation of people in the 25–44 age band, under-representation of white ethnicity, and of less deprived people. Geographical focus is in London, with less practices in the southwest and east of England. We found differences in the prevalence of diabetes (national: 6.4%, RCPG RSC: 5.8%), learning disabilities (national: 0.44%, RCPG RSC: 0.40%), obesity (national: 9.2%, RCPG RSC: 8.0%), pulmonary disease (national: 1.8%, RCPG RSC: 1.6%), and cardiovascular diseases (national: 1.1%, RCPG RSC: 1.2%). Data completeness in risk factors for diabetic population is high (77–99%). We found differences in prescribing rates and costs for infections (national: 5.58%, RCPG RSC: 7.12%), and for nutrition and blood conditions (national: 6.26%, RCPG RSC: 4.50%). Differences in vaccine uptake were seen in patients aged 2 years (national: 38.5%, RCPG RSC: 32.8%). Owing to large numbers, most differences were significant (p<0.00015). Future plans The RCGP RSC is a representative network, having only small differences with the national population, which have now been quantified and can be assessed for clinical relevance for specific studies. This network is a rich source for research into routine practice.
Article
Chronic Obstructive Pulmonary Disease (COPD) is a highly heterogeneous condition projected to become the third leading cause of death worldwide by 2030. To better characterize this condition, clinicians have classified patients sharing certain symptomatic characteristics, such as symptom intensity and history of exacerbations, into distinct phenotypes. In recent years, the growing use of machine learning algorithms, and cluster analysis in particular, has promised to advance this classification through the integration of additional patient characteristics, including comorbidities, biomarkers, and genomic information. This combination would allow researchers to more reliably identify new COPD phenotypes, as well as better characterize existing ones, with the aim of improving diagnosis and developing novel treatments. Here, we systematically review the last decade of research progress, which uses cluster analysis to identify COPD phenotypes. Collectively, we provide a systematized account of the extant evidence, describe the strengths and weaknesses of the main methods used, identify gaps in the literature, and suggest recommendations for future research.
Chapter
Principal components analysis (PCA) is a commonly used descriptive multivariate method for handling quantitative data and can be extended to deal with mixed measurement level data. For the extended PCA with such a mixture of quantitative and qualitative data, we require the quantification of qualitative data in order to obtain optimal scaling data. PCA with optimal scaling is referred to as nonlinear PCA, (Gifi, Nonlinear Multivariate Analysis. Wiley, Chichester, 1990). Nonlinear PCA including optimal scaling alternates between estimating the parameters of PCA and quantifying qualitative data. The alternating least squares (ALS) algorithm is used as the algorithm for nonlinear PCA and can find least squares solutions by minimizing two types of loss functions: a low-rank approximation and homogeneity analysis with restrictions. PRINCIPALS of Young et al. (Principal components of mixed measurement level multivariate data: an alternating least squares method with optimal scaling features 43:279–281, 1978) and PRINCALS of Gifi (Nonlinear Multivariate Analysis. Wiley, Chichester, 1990) are used for the computation.